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ABSTRACT 

In  a  layered  feed-forward  network  the  error  surface  with  respect  to  a 
desired  goal  function  completely  describes  the  potential  of  that  network  to  carry 
out  a  particular  task.  Such  an  error  surface  can  be  obtained  by  plotting  the  max¬ 
imum  error  or  the  sum  of  the  squares  of  the  errors,  where  the  individual  errors  are 
the  deviation  of  the  actual  output  of  the  network  from  the  desired  goal  function. 
For  a  network  with  W  individual  (synaptic)  weights,  these  error  surfaces  are  W- 
dimensional  surfaces  embedded  in  a  W+l  -dimensional  space  and  defined  over 
the  IF -dimensional  domain  determined  by  the  ranges  of  the  W  weights. 

We  have  studied  in  detail  the  error  surfaces  of  a  few  small  networks  that  we 
believe  exhibit  some  of  the  typical  characteristics  also  found  in  large  networks. 
For  the  chosen  networks  we  have  determined  the  inherent  symmetries  of  their 
associated  error  surfaces,  the  shape  of  the  region  around  the  origin,  and  the 
numbers  and  types  of  local  minima.  For  each  type  of  local  minimum,  we  exam¬ 
ined  its  multiplicity,  the  fractional  coverage  of  the  total  domain  space  by  its  col¬ 
lection  zone,  and  the  approximate  shape  of  the  valleys  leading  into  it.  This  study 
of  small  three-layered  networks  performing  the  Exclusive-OR  function  of  two  or 
three  inputs  provides  some  insight  into  the  general  structures  of  the  error  surfaces 
and  thus  the  capabilities  of  more  complicated  networks. 
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1.  MOTIVATION 

A  neural  network  is  a  system  that  achieves  computational  power  via  a  massively  parallel 
network  composed  of  many  simple  computational  elements  arranged  in  patterns  similar  to  those 
found  in  some  biological  neural  systems.  Computational  elements  are  connected  by  links  that 
can  vary  in  strengths  called  weights .  The  patterns  of  connections,  which  are  called  architec¬ 
tures  ,  have  a  great  effect  on  the  computational  power  of  the  network.  Networks  with  different 
architectures  have  different  computation  capabilities.  Although,  we  have  some  knowledge  about 
such  networks,  their  real  potential  will  remain  a  conjecture  until  their  characteristics  are  better 
understood. 

One  example  of  an  architecture  is  the  layered  feed-forward  network.  This  class  of  networks 
has  recently  met  with  great  interest  because  of  its  ability  to  represent  complex  mappings.  More¬ 
over,  it  is  claimed  that  this  kind  of  network  is  able  to  capture  regularities  in  the  specified  mapping 
[1],  Specifically,  if  there  is  a  regularity  in  the  mapping,  these  networks  can  exploit  this  regularity 
to  generate  the  correct  output  patterns  for  input  patterns  they  have  never  encountered  before;  a 
task  called  generalization .  The  capability  of  a  layered  feed-forward  network  to  represent  a  par¬ 
ticular  mapping  is  completely  described  by  the  corresponding  error  surface .  The  error  surface, 
usually  high  dimensional,  measures  the  deviation  between  the  desired  mapping  and  the  actual 
mapping  performed  by  the  network  for  all  possible  combinations  of  link  strengths,  i.e.,  values  of 
its  weights.  Understanding  the  structure  of  the  error  surface  will  help  us  answer  many  questions 
about  the  network’s  properties. 

The  goal  of  this  report  is  to  increase  understanding  of  layered  feed-forward  networks  by 
studying  their  error  surfaces.  We  also  wish  to  present  new  visualization  techniques  for  high 
dimensional  error  surfaces.  We  have  chosen  to  concentrate  on  problems  concerning  error  sur¬ 
faces  associated  with  the  networks  performing  the  exclusive-or  (XOR)  function  and  the  parity 
function  of  three  inputs.  These  are  among  the  simplest  problems  that  cannot  be  solved  with  a 
simple  two-layer  network.  For  these  cases,  the  error  surfaces  are  mapped  out  and  presented  in  a 
form  that  are  intuitively  understandable.  We  believe  that  these  simple  studies  provide  insight 
into  the  structure  of  the  error  surfaces  of  more  complicated  neural  networks. 
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2.  LAYERED  FEED-FORWARD  NETWORKS 

A  layered  feed-forward  network  is  a  type  of  neural  network  that  consists  of  layers  of  simple 
computational  elements.  These  computational  elements,  referred  to  as  neurons ,  are  connected  by 
links  with  variable  weights. 

In  a  layered  feed-forward  network  only  neurons  belonging  to  different  layers  are  connected 
by  links.  The  first,  or  bottom,  layer  is  an  input  layer.  The  neurons  in  the  input  layer  are  not 
actual  computational  elements.  Their  task  is  simply  to  provide  an  impedance  match  to  the  input 
signals.  The  last,  or  top,  layer  is  an  output  layer.  Every  observable  output  variable  is  associated 
with  one  output  neuron.  Between  the  input  and  the  output  layers  there  can  be  any  number  of  hid¬ 
den  layers.  Each  neuron  in  the  hidden  layers  must  receive  its  input  signal  from  the  neurons  in  the 
layers  lower  than  its  own  and  must  send  the  output  signal  to  neurons  in  layers  higher  than  its 
own.  When  given  an  input  vector,  the  network  computes  its  output  by  a  forward  pass.  In  this 
event  all  neurons  in  each  layer  receive  vector  signals  from  lower  layers  and  compute  outputs  for 
subsequent  layers. 

Outputs 

Output  Layer 


Hidden  Layer 


Input  Layer 

Inputs 

Fig.  1  A  Network  with  Complete  Connection  between  Adjacent  Layers 

Computational  elements,  or  neurons,  are  usually  non-linear  analog  processors  capable  of 
performing  simple  tasks,  such  as  summing,  thresholding,  scaling.  A  typical  model  of  a  neuron  is 
a  processor  that  forms  the  weighted  sum  of  all  its  input  signals,  adds  a  bias  term  to  the  sum,  and 
then  passes  the  result  through  a  switching  function  as  shown  in  Fig  2.  The  value  of  the  function 
becomes  the  output  of  the  neuron. 

The  particular  networks  considered  in  this  report  belongs  to  a  class  of  three-layer  networks 
with  complete  connectivity  between  the  adjacent  layers.  A  sample  of  this  class  of  network  is 
illustrated  in  Figure  3.  The  network  is  composed  of  three  layers  of  neurons  :  an  input  layer  (LI), 
a  hidden  layer  (L2),  and  an  output  layer  (L3).  Neurons  in  one  layer  are  connected  only  to  neurons 
in  the  subsequent  layer.  That  is,  the  neurons  in  LI  are  connected  only  to  neurons  in  L2  but  are 
not  connected  directly  to  neurons  in  L3. 


Hard  Limiter 
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Fig.  2  Various  Types  of  Switching  Functions 
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Fig.  3  A  Sample  of  the  Class  of  Networks  Studied  in  this  Report. 

To  see  how  the  neurons  in  this  class  of  networks  compute  their  output,  let  us  consider  neu¬ 
ron  o  and  neurons  h  i,  h2>  •••  ,  hn  as  shown  in  Fig.  4. a.  In  this  figure,  neurons  h\,  h2  >  •••  <  hn  pro¬ 
vide  input  signals  to  neuron  o  through  connections  with  strengths  of  w\,  w2,  ...  ,  w„.  When  the 
neuron  o  receives  inputs  i  1,  i2,  ...  ,  in<  which  are  the  outputs  of  the  neurons  h  1,  h 2,  ...  ,  and  hn 
respectively,  it  forms  the  weighted  sum  of  all  the  incoming  signals  by  multipling  each  input  sig¬ 
nal  with  its  corresponding  weight  and  adding  the  products.  Also,  a  bias  term  wn+ 1  is  added  to  the 
weighted  sum,  and  the  new  sum  is  passed  through  the  sigmoid  function  such  as  that  demonstrated 
in  Fig.  4.b.  This  sigmoid  function  has  been  chosen  to  be  symmetric  with  respect  to  the  origin  in 
order  not  to  conceal  any  symmetry  inherent  in  the  topology  of  the  network  or  in  the  specified 
mapping.  Also,  we  normalize  the  value  of  the  function  so  that  the  output  of  a  neuron  is  in  the 
range  of  [-1,1]. 
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/  (  W\h  1  +Wf/t2+  •"  +  Wnhn  +  Wn+1  )  /  (°0 


Fig.  4.a  Input-Output  Relation  of  a  Neuron  Fig.  4.b  Sigmoid  Function 
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3.  MATHEMATICAL  DEFINITIONS 

3.1.  Input-Output  Relation  of  a  Layered  Feed-Forward  Network 

Out 


Fig.  5  A  Network  with  Two  Input  Neurons  and  Two  Hidden  Neurons 


When  an  input  is  presented  to  a  layered  feed-forward  network,  the  signals  are  propagated 
from  the  input  layer  through  hidden  layers  to  the  output  layer.  To  find  the  input-output  relation 
of  a  layered  feed- forward  network,  let  us  consider  the  example  shown  in  Figure  5.  This  network 
consists  of  two  input  neurons,  two  hidden  neurons  and  one  output  neuron.  In  all  but  the  input 
neurons,  the  sigmoid  function  illustrated  in  Figure  4.b  is  used  as  a  switching  function.  These  five 
neurons  in  the  network  are  connected  by  nine  adjustable  weights  labeled  according  to  Figure  5. 
The  values  of  these  weights  can  be  represented  by  an  ordered  nine-tuple  of  real  numbers  (wi,w2 

,  ...  ,  wg,  w9 )  and  be  conveniently  referred  to  as  a  nine-dimensional  point  or  a  vector  with  nine 

components.  Let  us  denote  an  input  vector  (  i ii )  by  i,  and  a  vector  ( w\,  vv2  ,  ...  ,  wg,  w9 ) 
representing  a  set  of  the  weights  of  the  network  by  w.  The  output  values  of  the  hidden  neurons, 
h  i  and  h  2  can  be  computed  by  the  following  equations  : 

Al(W,i)  =  SWf  (  VV  4  +  W  5i  j  +  Wgi  2),  (1) 

/i2(w,  i)  =  swf(  w7  +  wgi1  +  W9/2).  (2) 

where  swf  is  the  sigmoid  switching  function.  The  output  neuron  uses  h  \  and  /i 2  to  compute  its 
output  ( 0 1 ),  which  is  the  output  of  the  network,  according  to  the  following  equation: 

c>i(w,  i)  =  w/(  vv7 +  wg/[  +  w9i2).  (3) 

Equations  1,  2  and  3  define  the  input-output  relation  of  the  network  for  a  given  set  of  weights  w. 
This  overall  input-output  relation  of  the  network  can  be  summarized  into  one  equation: 

0  =  O  (  w,  i ).  (4) 

For  this  particular  network,  the  input-output  relation  is  a  function  from  R2  to  R  with  w  as  a 
vector-valued  parameter. 
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For  a  network  with  m  input  neurons  and  n  output  neurons,  an  input  vector  i  becomes  a  vec¬ 
tor  with  m  components  and  an  output  vector  o  becomes  a  vector  with  n  components.  We  call  the 
set  containing  all  possible  input  vectors  an  input  space  and  the  set  containing  all  possible  output 
vectors  an  output  space .  In  this  case,  the  input  space  is  a  subset  of  Rm  and  the  output  space  is  a 
subset  of  Rn .  Furthermore,  if  we  assume  that  the  network  consists  of  k  weights,  then  w  becomes 
a  vector  with  k  components.  We  call  the  set  containing  all  possible  w  a  weight  space .  When  we 
relate  equation  (4)  to  this  case,  the  input-output  relation  of  the  network  will  be  a  function  from  an 
input  space  to  an  output  space  with  w,  a  vector  in  a  weight  space,  as  a  parameter. 


3.2.  Mapping  Representability  of  a  Network 

As  the  previous  section  mentions,  the  input-output  relation  of  a  layered  feed-forward  net¬ 
work  is  a  function  whose  mapping  rule  depends  on  the  connection  weights  of  the  network.  That 
is,  by  altering  the  weights,  the  network  can  be  used  to  implement  different  mappings.  Suppose 
we  want  the  network  to  perform  the  goal— mapping  G ,  which  is  defined  as  {  (  ij,  gi ),  (  h.  g2  )> 

.  ( ‘x  <  g«  )  } ,  where  ( \p ,  gp  )  are  respective  the  input-output  pair.  If  we  restrict  the  input  space  of 
the  network  to  the  domain  of  G ,  denoted  as  D  (G ),  the  restricted  input-output  relation  of  the  net¬ 
work  becomes  a  mapping  with  the  domain  equal  to  D{G )  and  with  the  mapping  rule  O  (w,  ). 
We  call  the  mapping  produced  by  the  network  for  this  restricted  input  set  an  actual  mapping  and 
we  denote  it  as  A  (w).  By  comparing  A  (w)  to  mapping  G  ,  we  can  see  how  well  the  network  per¬ 
forms  a  specified  task  when  it  is  assigned  a  set  of  weights  w. 

As  an  alternative  to  describing  mapping  G  by  specifying  all  input-output  pairs,  let  us 
describe  mapping  G  by  specifying  its  domain  D  (G )  and  its  range  R  (G).  We  order  the  elements 
in  R  (G )  so  that  the  p  -th  element  of  R  (G )  is  an  image  of  the  p  -th  element  of  D  (G )  under  map¬ 
ping  G,  forp  e  1,2,  •  •  •  ,g.  According  to  this  method,  mapping  G  can  be  described  as  a  map¬ 
ping  with 

domain D(G)  =  {  it,  12.  <i*  )>  (5) 

and 

range  R(G)  =  {  gt,  g2.  "•  »g*  )• 

As  A  (w)  and  G  have  the  same  domain,  by  keeping  the  elements  of  the  domain  of  A  (w)  in  the 
same  order  as  those  of  D(G ),  we  can  describe  mapping  A  (w)  as  a  mapping  with 

domain  D  (A  (w))  =  {  it,  h,  •••  ,ig  },  (7) 


and 


range  R  (A  (w))  =  {  Oj,  02,  ,og  }.  (8) 

where  op=0(yv,ip).  Since  the  domains  of  G  and  A( w)  are  the  same,  we  can  define  a 
difference  mapping,  denoted  as  (  G  -  A  (w) ),  as  follows  : 

(G  -A(yf))(ip)  =  G(ip) - A(w,ip);  (9) 

for  every  \p  in  the  domain  of  G  and  A  (w).  This  new  mapping  has  the  same  domain  as  that  of  G 
or  of  A  (w)  and  the  range  which  is  described  as  : 

range  of  (  G  -  A  (w)  )  =  {  (  gi  -  Oi ),  (  g2  -  02 ),  '  '  .(  g*  -*>*)}• 


(10) 
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By  examining  the  magnitude  of  the  elements  in  the  range  of  (G  -A  (w)  ),  we  can  find  out  how 
well  the  network  with  weights  w  can  represent  mapping  G  .  For  example,  if  all  the  elements  in 
the  range  of  mapping  (G-A  (w))  are  zero,  this  function  will  map  every  element  in  its  domain  to  0. 
It  means  that  A  (w)  and  G  are  the  same,  and  in  this  case  the  network  with  weights  w  can  exactly 
represent  mapping  G .  On  the  contrary,  if  the  magnitudes  of  most  elements  in  the  range  of  map¬ 
ping  (G-A( w))  are  large,  A  (w)  and  G  will  be  are  quite  different.  This  implies  that  the  perfor¬ 
mance  of  the  network  with  weights  w  is  poor  in  representing  mapping  G . 

3.3.  Error  Function  and  Error  Surface 

Rather  than  using  the  range  of  mapping  (  G-A  (w) )  to  reflect  the  performance  of  the  pre¬ 
viously  discussed  network,  it  is  preferable  to  use  only  one  scalar  value  to  measure  the  perfor¬ 
mance  of  this  network  for  representing  mapping  G .  Therefore,  we  will  associate  the  range  of 
mapping  ( G  -  A  (w) )  with  a  given  norm  and  then  use  this  norm  as  a  performance  measure.  Sup¬ 
pose  the  network  has  n  output  neurons  and  mapping  G  consists  of  g  ordered  pairs.  This  implies 
that  there  are  also  g  ordered  pairs  in  mapping  (G  —  A (w) ).  According  to  the  mapping  descrip¬ 
tion  mentioned  above,  the  range  of  (  G  —  A  (w) )  must  have  g  elements  and  may  be  described  by 
equation  (10).  Reconsidering  this  equation,  let  us  substitute1  ( gp\  gf,  ■■■  ,gp)  for  gp  and 
(  opl ,  Op2,  ■  ■  • ,  o" )  for  Op .  The  range  of  (  G  -  A  (w) )  then  becomes 

{((W -W).(S? -of) . (tei  ~ol),(gz-ot) . (gl-ol)). 

■  •  (teg  - Og), teg ’ - °g)' •••» teg  - ojf»  } • 

Since  we  are  not  concerned  with  each  individual  vector,  we  can  rewrite  the  set  of  vectors 
specified  above  as  a  single  vector: 

(  (gt  -oh, ter -oh,-, tel -oV,  tel  -oh,tei -oh,-, tel -oh, 

•  ■  • .  teg  -  og ).  teg  -  oh,  -  -  teg  -  og)  )■ 

We  call  this  ordered  gn  -tuple  a  vector  representing  the  range  of  (  G  -  A  (w) ).  The  magnitudes  of 
the  elements  in  the  range  of  ( G  -  A  (w) )  reflect  the  difference  between  mapping  G  and  mapping 
A  (w).  They  are  summarized  in  the  norm  of  the  vector  representing  the  range  (  G  -  A  (w) ), 
which  can  thus  be  viewed,  intuitively,  to  represent  the  error  between  mapping  G  and  mapping 
A  (w).  We  thus  call  the  function  that  maps  w,  a  point  in  the  weight  space,  to  this  error  value  the 
error  f  unction ,  with  respect  to  the  goal  mapping. 

The  definition  of  an  error  function  depends  on  the  definition  of  the  vector  norm.  With  a 
Euclidean  norm,2  the  error  function  is  expressed  as 

l 

EucErr(w)  =  (  t  t  tej-Ote^h)^  )T  (1!) 

< =i  /=i 

However,  for  ease  of  computation  we  will  employ  the  square  of  the  Euclidean  norm  in  our  study 
cases.  Therefore,  our  error  function  is  defined  as 

TsqErr(w)  =  (  T  £  (g‘>  -  0‘  ( w,  iy-  )  )2  )  (12) 


1  We  use  a  superscript  to  indicate  output  neuron  number. 

2  For  vector  x  specified  as  ( xi,  x2,  ■  •  •  ,  x,  ),  the  Euclidean  norm  is  defined  as  ( xf  +  x}  +  •  •  •  +  x„2  )T. 


I 
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If  we  are  interested  in  the  maximum  error  made  for  any  of  the  elements  in  the  domain  of  G ,  a 
norm  defined  as  : 

=  max(  Ixi  1 ,  1x2 1 ,  ■  •  •  ,  Ixn  I  ).  (13) 

should  be  used.  This  gives  the  error  function  of  the  form  : 

MaxErr(w)  =  max  I  gj  -  Ol(  \v,  \j  )  I ,  (14) 

for  i  e  {  1,2,... ,  n  },  and  j  e  {  1,2 . g}. 

For  a  given  network  and  a  given  problem  statement,  an  error  function  that  maps  a  set  of 
weights  to  an  error  value  fully  describes  how  well  the  network  performs  for  all  possible  values  of 
its  weights.  Since  the  error  function  that  we  use  in  this  report  is  continuous,  we  can  use  this  func¬ 
tion  to  define  a  surface  called  an  error  surface  ?  By  studying  this  surface,  we  can  know  how 
well  the  network  can  be  used  to  represent  a  designated  mapping.  Unfortunately,  these  error  sur¬ 
faces  are  of  very  high  dimensions  and  thus  cannot  be  easily  perceived.  We  shall  now  discuss 
some  visualization  techniques  to  study  such  hypersurfaces. 


3  A  surface  of  dimension  n,  or  n  -surface,  is  a  set  in  R"+l  that  contains  all  the  points  of  the  form 
(c.  /(c)).  where  /  is  a  smooth  function  from  R"  — *  R  and  c  belongs  to  the  domain  of  /  . 
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4.  MODELING  ERROR  SURFACES 

For  a  layered  feed- forward  network  with  k  variable  weights,  an  error  surface  is  an  k- 
dimensional  manifold  embedded  in  an  k+\ -dimensional  space  and  defined  over  an  k  -dimensional 
space  given  by  the  weight  space.  This  error  surface  completely  describes  the  potential  of  the  net¬ 
work  for  its  desired  goal  mapping.  Therefore,  understanding  this  surface  will  help  us  answer  all 
the  questions  related  to  the  capability  of  the  associated  network  in  performing  a  designated  map¬ 
ping.  For  a  very  simple  network,  such  as  the  Perceptron  with  two  inputs,  error  surfaces  are  sur¬ 
faces  of  two  or  three  dimensions,  and  thus  can  be  displayed  with  regular  graphical  techniques. 
However,  for  a  more  complicated  system,  error  surfaces  are  usually  of  very  high  dimension 
where  a  simple  visualization  is  impossible.  Accordingly,  for  high-dimensional  surfaces  we  need 
to  consider  ways  other  than  those  of  modeling  the  surfaces.  One  alternative  approach  for  portray¬ 
ing  such  hyper-surfaces  is  to  present  them  in  conceptual  forms  that  reflect  significant  structures 
of  those  surfaces.  Exactly  what  significant  means  depends  on  the  particular  questions  that  con¬ 
cern  us. 

In  this  report,  we  have  chosen  to  concentrate  on  problems  concerning  error  surfaces  associ¬ 
ated  with  networks  performing  Boolean  functions.  Since  the  sigmoid  function  employed  as  a 
switching  in  our  neurons  has  its  output  value  defined  over  [  -1,  1  ],  we  use  the  value  of  1  to 
represent  the  on  state  and  the  value  of  -1  to  represent  the  off  state.  This  representation  also 
preserves  symmetries  inherent  in  the  mapping  statement  The  importance  of  these  symmetries 
will  be  fully  discussed  later. 

4.1.  Error  Surfaces  of  a  Perceptron  with  Two  Inputs 


Out 


Fig.  6  A  Perceptron  with  Two  Inputs 


For  a  layered  feed-forward  network  with  two  or  three  variable  weights,  the  error  surface  has 
two  or  three  dimensions,  and  thus  can  be  actually  modeled.  This  is  illustrated  by  the  error  sur¬ 
faces  associated  with  the  Perceptron  with  two  inputs,  shown  in  Figure  6.  This  system  has  three 
variable  weights;  therefore,  the  error  surfaces  affiliated  with  this  system  will  be  3-dimensional 
surfaces.  One  way  to  model  these  surfaces  is  by  displaying  a  family  of  contour  plots,  which  is 
demonstrated  in  Figure  7.  From  Figure  7.3,  which  illustrates  the  error  surface  of  the  Perceptron 
performing  the  XOR  mapping,  we  see  that  no  local  minimum  exists  deep  enough  to  constitute  a 
good  solution.  Therefore,  the  single  Perceptron  cannot  perform  the  XOR  mapping. 


\V3 
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4.2.  High-Dimensional  Error  Surfaces 

To  gain  some  general  insight  into  the  structures  of  high-dimensional  error  surfaces,  we  shall 
study  the  error  surfaces  associated  with  the  following  problems  : 

the  network  in  Figure  8.1  with  2  hidden  neurons  performing  the  XOR  function, 
the  network  in  Figure  8.2  with  3  hidden  neurons  performing  the  XOR  function, 
the  network  in  Figure  8.3  with  4  hidden  neurons  performing  the  XOR  function, 
the  network  in  Figure  8.4  with  3  hidden  neurons  performing  the  3-input  odd-parity 
function, 

In  these  study  cases,  we  shall  map  out  the  error  surfaces  and  then  portray  the  derived  infor¬ 
mation  in  terms  of  tables  and  graphs.  We  are  interested  in  the  potential  of  these  networks  to 
represent  the  designated  mappings  and  in  the  efficiency  of  the  networks  in  learning  to  perform 
such  a  mapping.  Our  conceptual  models  will  concentrate  on  the  following  information  . 

the  nature  of  the  error  surface  at  the  origin, 

the  number  of  different  types  of  minima  and  the  multiplicity  of  each, 

the  approximate  shapes  of  the  valleys  affiliated  with  the  various  minima,  and 

the  fractional  coverage  of  the  total  domain  space  by  the  collection  zone  of  each 

minimum.1 


Out 


Fig.  8.1  A  Network  with  Two  Input  Neurons  and  Two  Hidden  Neurons 


1  The  collection  zone  of  a  given  minimum  is  the  region  for  which  a  steepest  descent  ends  up  in  the  given 
minimum. 
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Out 


Fig.  8.2  A  Network  with  Two  Input  Neurons  and  Three  Hidden  Neurons 


Out 


Fig.  8.3  A  Network  with  Two  Input  Neurons  and  Four  Hidden  Neurons 
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Out 


1 1  *2  *3 

Fig.  8.4  A  Network  with  Three  Input  Neurons  and  Three  Hidden  Neurons 


-  16- 


5.  SYMMETRIES  IN  WEIGHT  SPACE 

An  error  surface,  defined  over  the  weight  space  of  a  network,  reflects  the  symmetries  of  the 
network  and  of  the  desired  mapping.  These  symmetries  are  very  useful  to  reduce  the  fraction  of 
the  surface  that  needs  to  be  explicitly  explored  in  order  to  understand  the  behavior  of  the  net¬ 
work.  In  this  section  we  shall  investigate  the  symmetries  that  exist  in  our  study  cases  and  derive 
the  regions  in  the  weight  spaces  that  need  to  be  investigated. 

» 

5.1.  Nature  of  Symmetries 

The  symmetries  exhibited  on  an  error  surface  of  a  network  can  be  characterized  as  those 
reflecting  the  symmetries  of  the  network  and  those  reflecting  the  symmetries  of  the  desired 
input/output  mapping.  To  see  the  effects  of  the  symmetries  in  a  network  on  its  associated  error 
surface,  let  us  examine  the  network  in  Fig  8.1.  The  first  symmetry  in  any  error  surface  associated 
with  network  8.1  occurs  because  the  signs  of  all  the  weights  connected  to  hidden  neuron  h\  can 
be  inverted  simultaneously  without  changing  the  overall  i/o  relation  of  the  network  [2],  This 
corresponds  to  the  operation  : 

opHl  :  — »  -wj,  1V4  — »  -vv4,  W5  —>  -w 5,  wj  — »  -w 5. 

When  we  apply  opHl  to  w,  a  point  in  the  weight  space  of  network  8.1,  opHlw,  the  image  of  w 
under  opHl,  will  have  the  same  error  as  w,  and  we  say  w  and  opHlw  are  equivalent  according  to 
this  type  of  symmetry.  Similarly,  changing  the  signs  of  the  weights  connected  to  neuron  h  2  also 
leaves  the  behavior  of  network  8.1  unchanged;  this  is  achieved  by: 

opH2  :  W3  —>  -w 3,  W7  — »  -W7,  wg  — » -wg,  W9  — »  -W9. 

Since  22  different  operations  can  be  derived  from  the  combinations  of  opHl  and  opH2,  four 
equivalent  configurations  result  from  this  type  of  symmetry.  This  tells  us  that  the  overall  input- 
output  relation  of  a  network  is  unchanged  if  the  signs  of  all  the  weights  connected  to  a  hidden 
neuron  in  that  network  are  simultaneously  inverted.  Therefore,  for  a  network  with  H  hidden  neu¬ 
rons,  2H  equivalent  configurations  result  from  this  type  of  symmetry. 

For  a  network  with  complete  connectivity  between  its  layers,  the  connection  patterns  of  all 
neurons  in  the  same  hidden  layer  will  be  identical  and  any  permutation  of  the  roles  of  the  hidden 
neurons  in  that  layer  will  lead  to  the  same  behavior  of  the  network  [2],  For  example,  for  a  hidden 
layer  of  H  neurons  where  the  roles  of  these  H  neurons  can  be  arranged  in  H !  different  ways,  the 
weight  space  associated  with  this  case  exhibits  H\  equivalent  configurations.  In  network  8.1, 
hidden  neurons  h  1  and  h  2  are  connected  identically  to  the  rest  of  the  system,  so  changing  the 
roles  of  these  two  neurons  does  not  affect  the  function  performed  by  the  network.  As  a  result,  the 
error  function  of  network  8.1  is  not  changed  when  we  make  the  following  four  substitutions: 

opHx  :  w2  i — >  w 3,  w4  < — >  W7,  W5  i — >  wg,  W6  < — »  w>9. 

Thus  two  equivalent  configurations  result  from  this  symmetry. 

When  a  network  with  complete  connectivity  between  the  input  layer  and  the  first  hidden 
layer  is  used  to  perform  the  /  -bit  parity  function,  we  can  arrange  /  input  neurons  in  / !  ways 
without  changing  the  behavior  of  the  network.  These  equivalent  arrangements  result  in  II 
equivalent  configurations  in  the  error  surface  with  respect  to  /  -bit  parity.  For  example,  when  we 
employ  network  8.1  to  represent  exclusive-OR  mapping,  the  symmetries  inherent  in  this  function 
will  be  reflected  in  the  error  surface.  One  symmetry  in  this  function  comes  from  the  fact  that 
input  signals  ( -1,  1  )  and  (  1,  -1  )  both  map  to  (  1  ).  Exchanging  the  identities  of  input  neurons  i  1 
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and  1 2  does  not  affect  the  global  characteristics  of  the  network  since  the  error  function  sums 
equally  over  all  input  patterns.  This  exchange  can  be  simulated  by  the  following  operation: 

oplx  :  W5  4 — »  w  6,  w  s  4 — »  W9. 

Since  there  are  2!  ways  to  arrange  two  input  neurons,  two  equivalent  configurations  arise  from 
this  type  of  symmetry. 

Another  symmetry  related  to  the  XOR  function  occurs  because  simultaneous  complementa¬ 
tions  of  the  two  inputs  do  not  change  the  definition  of  the  XOR  function.  On  the  other  hand, 
when  we  complement  only  one  input,  the  output  is  complemented.  The  inversion  of  the  signal  to 
unit  t'i  can  be  simulated  by  changing  the  signs  of  w 5  and  wg.  The  inversion  of  the  output  is 
achieved  by  changing  the  signs  of  w  1,  wj  and  w  3.  Therefore,  when  the  operation 

Opll  :  W5  4 - »  -W5,  Wj  4 — »  -Wj,  w !  4 — >  -wi,  H>2  4 — »  -W2,  W3  4 — ►  -W3 

is  carried  out,  the  behavior  of  network  5.2  is  left  unchanged.  In  the  same  way,  inversion  of  the 
input  signal  to  neuron  t2  and  of  the  output  signal  according  to  the  operation 

OpI2  :  W6  4 - »  -Wg,  W9  4 - >  -W9,Wi  4 - »  -W),  W2  < — >  -h,2»  w3  « — *  _vv3 

does  not  affect  the  behavior  of  network  8.1.  This  symmetry  results  in  22  equivalent 
configurations  in  the  error  surface  associated  with  the  XOR  mapping.  When  we  generalize  this 
result  to  the  case  of  the  /-bit  parity  function,  this  type  of  symmetry  produces  2l  equivalent 
configurations  in  the  associated  error  surface. 

In  summary,  for  a  fully  connected  layered  feed-forward  network,  the  number  of  equivalent 
configurations  produced  by  a  hidden  layer  of  H  neurons  is  2 H H !.  V/hen  such  a  network  is  used 
to  represent  the  /-bit  parity  function,  the  error  surface  associated  with  this  case  has  at  least 
21 1  \2H H !  equivalent  configurations.  For  example,  when  networic  8.1  is  employed  to  perform 
XOR  mapping,  the  associated  error  surface  will  exhibit  sixty-four1  fold  symmetry.  To  know 
these  symmetries  will  enable  us  to  define  the  fraction  of  the  weight  space  that  we  need  to  explore 
in  order  to  understand  the  behavior  of  such  a  networic. 

5.2.  Defining  a  Representative  Sector 

For  any  error  surface  and  the  weight  space  over  which  it  is  defined,  symmetries  in  the  error 
surface  imply  certain  equivalences  among  points  in  the  weight  space.  These  equivalences  can  be 
used  to  divide  the  weight  space  into  regions  such  that  the  error  surface  pieces  over  these  regions 
have  the  same  characteristics.  One  such  portion  of  the  error  surface  (  which  shows  all  charac¬ 
teristics  of  the  entire  error  surface  )  is  called  a  representative  part  of  the  error  surface ,  and  its 
corresponding  domain  is  called  a  representative  sector .  We  need  to  explore  only  a  representa¬ 
tive  sector  in  order  to  understand  the  whole  structure  of  the  error  surface. 

A  representative  sector  can  be  defined  by  specifying  its  boundaries.  A  logical  choice  for 
these  boundaries  is  a  subset  of  the  hyper-planes  left  invariant  by  inherent  symmetry  operations. 
For  example,  a  mirroring  operation,  vv;  — »  -w,-,  leads  to  a  hyper-plane  of  the  form  w,-  =  0. 
Exchanging  the  values  of  two  weights,  w,-  4 — »  wj,  implies  a  hyper-plane  of  the  form  w,-  -  w;-  = 
0.  Each  of  these  operations  reduces  the  amount  of  space  that  must  be  investigated  by  a  factor  of 
2;  that  is,  for  each  operation,  we  can  restrict  ourselves  to  examining  only  the  region  to  one  side  of 


1  22  x  2!  x  22  x  2! 
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the  hyper-plane  implied  by  the  operation.  When  more  than  one  operations  are  involved,  hyper¬ 
plane  boundaries  must  be  properly  defined  so  that  the  application  of  all  the  involved  operations  to 
the  chosen  region  can  produce  the  entire  weight  space. 

In  selecting  hyper-plane  boundaries  associated  with  a  given  set  of  symmetry  operations,  we 
assign  the  given  operations  different  priorities.  Any  sector  partitioned  by  hyper-planes  associated 
with  operations  of  higher  priorities  are  closed  under  operations  of  lower  priorities.2  For  example, 
if  opA,  opB  are  assigned  with  priorities  higher  than  that  of  opC,  then  hyper-planes  implied  by 
opA  and  opB  can  be  defined  in  such  a  way  that  any  sector  bounded  by  these  two  hyper-planes  is 
closed  under  opC.  Once  the  priorities  are  determined,  the  operations  of  the  first  priority  are  used 
to  define  hyper-planes  to  bound  a  proper  region,  referred  to  as  the  first  -level  representative  sec¬ 
tor  ,  in  an  associated  weight  space.  This  first-level  sector  must  be  defined  in  such  a  way  that, 
when  we  apply  all  combinations  of  operations  with  highest  priority  to  this  region,  the  entire 
weight  space  must  be  generated.  We  then  subdivide  the  first  level  sector  by  the  operations  of  the 
second  priority.  The  subdivision  recursively  proceeds  in  this  manner  until  all  the  operations  have 
been  considered  and  the  final  representative  sector  has  been  found  at  the  lowest  sector  level. 

Applying  the  above  method  to  the  case  of  network  8.1  performing  the  XOR  mapping,  we 
assign  the  first  priority  to  opHl,  opH2  and  opHx,  the  operations  associated  with  the  symmetries 
in  the  network,  and  the  second  priority  to  opll,  opI2  and  oplx,  the  operations  associated  with  the 
symmetry  in  the  XOR  mapping.  The  reason  for  this  assignment  is  that  opll,  opI2  and  oplx  do 
not  affect  w4  (  the  threshold  of  neuron  h  i )  and  w7  ( the  threshold  of  neuron  h2 ).  That  is,  w  and 
its  image  under  any  combination  of  these  three  operations  have  the  same  w4  and  w7  components. 
If  we  define  the  hyper-plane  associated  with  opHl,  opH2  and  opHx  in  terms  of  only  w4  and  w7, 
any  partitions  defined  by  these  hyper-planes  will  be  closed  under  any  combination  of  opll,  opI2 
and  oplx.  We  then  further  subdivide  the  first  level  sector  by  opll,  opI2  and  oplx.  The  second 
level  sector  derived  from  this  partition  can  be  used  as  a  representative  sector.  This  second  level 
sector  is  a  valid  representative  sector,  because  when  we  apply  all  eight  distinct  combinations  of 
opll,  opI2  and  oplx  to  this  region,  we  produce  the  first  level  sector.  Then,  by  applying  the  eight 
combinations  of  opHl,  opH2  and  opHx  to  the  first  level  sector,  we  generate  the  entire  weight 
space. 

To  locate  the  first-level  sector,  we  first  employ  opHl  and  opH2  to  divide  the  weight  space 
into  four  equivalent  sectors.  Because  the  sets  of  the  weights  mirrored  by  each  of  these  two  opera¬ 
tions  are  disjoint,  we  can  define  hyper-planes  opHl  and  opH2  independently  of  each  other.  The 
mirroring  of  w;  implies  a  hyper-plane  of  the  form  w,-  =  0.  OpHl,  a  composite  operation  that  mir¬ 
rors  wi,  W4.W5  and  wg,  implies  a  hyper-plane  of  the  form 

aw2  +  bwn  +  0V5  +  dwf,  =  0, 

where  a,  b,  c  and  d  are  arbitrary  numbers.  Similarly,  opH2,  a  composite  operation  that  mirrors 
W3,  w 7,  wg  and  W9,  implies  a  hyper-plane  of  the  form 

ewj  +/VV7  +  gwg  +  kwy  =  0, 

where  e,  f,  g  and  h  are  arbitrary  numbers.  By  restricting  ourselves  to  w  4  and  w7,  hyper-planes 


2  For  function  /  :S  -*S  and  subset  T  in  S  ,  /  is  closed  under  T  if  for  every  x  in  T ,  /  (x )  is  also  in  T . 
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opHl  and  opH2  are  reduced  to  w4  =  0  and  w7  =  0.  Examining  each  region  bounded  by  these  two 
hyper-planes,  we  find  that  the  region  that  satisfies 

w4  >  0  and  w7  >  0  (15) 

is  closed  under  opHx.  Since  it  is  closed,  we  choose  this  region,  implicitly  defining  the  hyper¬ 
plane  for  opHx  to  be  w4  =  w7.  Considering  these  three  hyper-planes,  we  define  the  first-level 

sector  as  a  region  satisfying 

w4  h  w-j  h  0.  (15) 

This  is  a  valid  first-level  sector,  because  when  we  apply  all  eight  combinations  of  opHl,  opH2 
and  opHx  to  this  region,  the  entire  weight  space  is  generated. 

To  find  second  level  sectors,  we  note  that  the  relationship  of  opll,  opI2  and  oplx  to  one 
another  are  similar  to  that  of  opHl,  opH2  and  opH3.  Hyper-planes  for  opll,  opI2  and  oplx  can 
thus  be  chosen  asw5  =  0,  =  0  and  w  $  —  respectively.  The  second  level  sector  which  is  also 

a  representative  sector  for  a  whole  problem  can  be  expressed  as  a  region  satisfying 

w4  >  w»7  >  0  and  W5  >  w6  >  0.  (17) 

For  a  case  of  a  network  with  H  hidden  neurons,  inequality  (16)  can  be  generalized  to 

Wh\  h  M>hJ  ^  "•  ^  Whi  h  •••  >  ww  h  0,  (18) 

where  whi  is  the  threshold  of  hidden  neuron  hi.  Two  factors  yield  these  inequalities.  The  first  is 
that  the  mirroring  of  wy  leads  to  a  hyper-plane  Whi  =  0.  The  second  is  that  one  permutation  of 
the  threshold  values  can  be  selected  from  the  rest  by  imposing  the  constraint: 

Wh  i  >  Wh 2  5  -  >  Whi  -  •"  -  WhH.  (19) 

Similarly,  a  sector  related  to  the  operations  associated  with  the  I-input  parity  function  can  be 
defined  as 

wj i  >  Wj2  S  >  Wji  >  •••  h  Wji  >  0,  (20) 

where  w;i-  is  the  weight  connecting  input  neuron  i  to  hidden  neuron  j .  Finally,  by  combining 
constraints  (19)  and  (20),  a  representative  sector  of  a  three-layered  network  with  complete  con¬ 
nectivity  performing  the  I-input  parity  function  can  be  defined. 
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6.  THE  ERROR  SURFACE  OF  NETWORK  8.1  PERFORMING  THE  XOR  MAPPING 


6.1.  Region  around  the  Origin 

In  the  error  function  of  network  8.1,  with  respect  to  the  XOR  mapping,  the  origin  is  the 
center  of  symmetry.  Using  TsqErr(  w  )  as  in  Equation  12,  the  value  at  the  origin  is  four,  and  the 
first  derivatives  at  the  origin  are  zero  in  all  directions.  The  error  surface  at  the  origin  has  a  posi¬ 
tive  curvature  for  the  axis  corresponding  to  w  \  and  zero  curvature  on  all  other  axes.  This  is  easily 
understood  by  looking  at  the  network  itself.  If  all  the  weights  are  zero,  the  output  of  the  network 
is  zero  for  every  input  pattern  and  each  individual  error  is  thus  one.  If  we  allow  only  w  i  to  be 
nonzero,  the  output  of  the  network  is 


1  -  exp(  — w  i 
1  +  exp  ( -w  i ) 


for  any  \p .  Near  the  origin,  this  equation  can  be  approximated  as 

0(ip )  = 

Then,  the  error  near  the  origin  and  along  w  i  axis  can  be  approximated  by: 

4  +  w 

This  result  tells  us  that  the  origin  is  a  local  minimum  along  the  direction  w\.  Changing  any  sin¬ 
gle  weight  other  than  w  {  will  not  result  in  a  change  of  error,  thus  producing  a  surface  that  is 
totally  flat  along  all  other  axes. 

Analyzing  the  pattern  of  connections  in  network  8.1,  we  find  that  at  least  four  nonzero 
weights  must  be  involved  in  defining  a  direction  where  the  error  is  reduced  when  we  move  away 
from  the  origin.  A  sample  direction  is 

(  wj,  w2,  W3,  w4,  VV5,  wg,  W7,  wj,  W9  )  =  (  0,  1,  0,-1,  1,-1,  0,  0,  0),  (20) 

and  the  error  along  this  direction  can  be  approximated  by 

4.00  -  0.25e4, 


where  e  is  the  distance  from  the  origin  along  the  direction  defined  by  (20).  Thus  the  origin  of  the 
error  surface  of  network  8.1  in  representing  the  XOR  function  is  a  multidimensional  saddle  point. 
Close  inspection  shows  that  it  is  a  plateau  in  which  a  few  very  shallow  valleys  start  radially  from 
the  origin;  in  all  other  directions  the  surface  turns  upwards. 


6.2.  Minima  in  the  Error  Surface 

To  obtain  a  survey  of  the  minima  in  the  error  surface  of  network  8.1  in  performing  the  XOR 
mapping,  we  employeded  a  search  from  one  thousand  points  randomly  distributed  in  the 
representative  sector.  We  limited  the  value  of  each  weight  to  [  -3,  3  ],  reducing  the  associated 
weight  space  to  a  compact  set,  and  recorded  the  local  minima  found  in  this  9-dimensional  cube. 
All  minima  fall  on  the  surface  of  this  cube,  i.e.  the  boundary  of  the  weight  domain.  For  each 
minimum,  we  also  recorded  the  percent  of  the  steepest  descents  approaching  that  point.  This  sta¬ 
tistical  figure  reflects  the  fraction  of  the  weight  space  covered  by  the  collection  zone  of  that 
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minimum. 

The  minima  found  can  be  separated  into  three  types;  the  minima  that  lie  in  the  representa¬ 
tive  sector  defined  by  Inequalities  (17)  are  used  as  representatives  of  their  types  and  are  listed  in 
Table  1.  Because  of  the  64-fold  symmetry  inherent  in  this  error  surface,  each  such  minimum  will 
be  replicated  sixty-four  times;  however,  some  of  these  replicas  may  fall  on  top  of  one  another. 
For  example,  minimum  MA  has  a  multiplicity  of  four  and  thus  there  are  a  total  of  only  sixteen 
such  locations  in  the  whole  weight  space. 

The  minima  of  type  MA  and  MB  are  local  minima  of  distinct  points  while  the  minima  of 
type  MC  are  not.  Each  member  in  type  MC  is  a  connected  set  that  forms  a  groove  of  constant 
error;  the  same  minimal  error  is  obtained  for  all  points  where  the  sum  of  two  weights  w2  and  vv3 
is  equal  to  1.62.  These  points  satisfy  the  conditions 


w5  =  W  s, 

and  w 6  =  W9, 

(21) 

W2  +  W3  = 

1.62. 

(22) 

Condition  (21)  implies  that  neurons  h\  and  /t2  have  identical  roles  and  thus  their  outputs  are  the 
same  for  every  input.  Denoting  these  two  equal  outputs  by  s ,  the  output  of  network  8.1  becomes 
swf  ( (w2  +  wj)s  ),  where  swf  0  is  the  sigmoid  function.  According  to  condition  (22),  for  every 
point  in  the  groove,  the  output  of  network  becomes  swf  ( 1.62s  ). 

Examining  the  minima  in  Table  1,  we  see  that  all  of  them  lie  at  the  surface  of  the  9- 
dimensional  cube  of  half-size  3.  They  are  not  the  real  minima  on  the  whole  error  surface;  they 
are  simply  the  lowest  points  in  valleys  that  radiate  out  from  the  center  of  the  surface  and  get  trun¬ 
cated  by  the  finite  size  of  the  domain.  To  find  the  real  minima,  we  can  selectively  trace  these  val¬ 
leys  to  larger  distances  from  the  origin. 

Based  on  the  capability  of  a  minimum  in  representing  a  solution  to  the  XOR  task,  we  clas¬ 
sify  these  minima  into  two  classes:  good  and  bad  minima.  Good  minima  produce  the  desired 
output  value  of  the  XOR  function  for  all  four  possible  input  patterns  within  some  small  error  mar¬ 
gin;  while  bad  minima  produce  an  output  that  will  show  an  unacceptably  large  error  for  one  or 
more  input  patterns.  With  this  classification,  MA  can  be  considered  a  good  minimum,  while  MB 
and  MC  are  bad  minima. 
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Table  1 

Minima  in  Error  Surface  of  Network  8.1 
Performing  the  XOR  Mapping 


MA 

MB 

AC 

I 

B 

0.0803 

2.0969 

8.95 

7.42 
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6.3.  Characteristics  of  Valleys 

Since  the  valleys  ran  roughly  radially  away  from  the  origin,  we  characterize  them  by  study¬ 
ing  their  error  value  profiles  on  subsequent  spheres  of  larger  and  larger  radii  around  the  origin. 
On  each  9-dimensional  sphere,  we  determine  a  local  minimum  by  running  a  steepest  descent 
from  a  starting  point  that  is  given  by  the  projection  of  the  local  minimum  on  the  previous  sphere. 
Finding  the  local  minimum  has  been  implemented  as  follows: 

Assume  that  at  iteration  n ,  the  process  is  at  point  x„.  To  find  xrt+i,  we  first  compute  the 
gradient  of  the  error  ftmction  at  x„ .  This  gradient  is  projected  along  the  radius  direction  onto  the 
8-dimensional  hyper-plane  which  is  tangent  to  the  sphere  at  x„ .  Let  us  denote  the  projected  gra¬ 
dient  by  ( VErr  )p .  According  to  this  projected  gradient,  the  process  moves  downhill  to  a  point  ( 
x  +  h(  VErr  )p  ),  where  h  is  a  step-size.  Projecting  the  point  (  x  +  h (  VErr  )p  )  along  its  radius 
direction  back  onto  the  sphere  locates  point  xn+i.  This  iterative  process  stops  when  (  VErr  )p  is 
smaller  than  a  given  threshold.  At  our  stopping  point,  we  define  an  8-dimensional  hyper-plane 
tangent  to  the  sphere  and  then  compute  the  principal  curvatures  of  the  error  function  on  this 
hyper-plane.  If  all  eight  curvatures  are  positive,  the  stopping  point  is  a  local  minimum  on  the 
sphere  and  the  process  ends.  In  all  other  cases,  we  choose  a  new  point  in  the  vicinity  of  the  stop¬ 
ping  point  and  lying  on  the  axis  of  strongest  negative  curvature.  From  this  point,  we  again  run 
the  steepest  descent  to  locate  a  new  stopping  point.  We  continue  this  process  until  we  find  a  final 
local  minimum. 

We  now  project  this  minimum  point  along  its  radius  direction  onto  a  new  sphere  of  larger 
radius  and  locate  a  minimum  on  the  new  sphere  with  the  above  process.  Generally,  this  win 
quickly  converge  to  a  corresponding  minimum  on  the  new  sphere,  since  the  minima  on  the  two 
spheres  are  near  each  other.  The  locus  of  aU  the  minima  on  subsequent  spheres  defines  the  floor 
of  a  valley. 

To  explore  the  structure  of  the  vaUey  associated  with  minimum  MA,  we  run  the  above  pro¬ 
cess  from  minimum  MA  tracing  out  the  valley’s  floor  backwards  toward  the  origin.  This  floor  is 
then  used  to  represent  the  path  of  the  valley.  In  Figure  9  we  portray  the  path  of  this  vaUey  by 
plotting  the  coordinates  w;  ,...,  W9  of  the  local  minima  as  a  function  of  the  radius  of  the 
corresponding  sphere.  Near  the  origin,  the  valley  splits  into  two  branches  which  then  merge 
again  at  a  distance  of  about  2  from  the  origin.  We  display  the  profile  of  the  valley  by  plotting 
the  error  value  along  the  floor  of  the  valley  as  a  function  of  the  radius  (  graph  AC  in  Figure  10  ). 
This  graph  illustrates  that  the  floor  of  the  valley  approaches  0  if  we  follow  it  far  enough  from  the 
origin.  It  is  thus  a  good  valley  since  it  leads  to  a  proper  solution  for  the  XOR  problem. 

We  also  apply  this  method  to  examine  the  valley  associated  with  minimum  MB.  We  find 
that  when  the  radius  of  the  sphere  is  less  than  3.3,  there  is  only  one  type  of  minimum  located  on  a 
sphere,  the  one  that  belongs  to  the  valley  associated  with  minimum  MA.  We  therefore  say  that 
the  valley  associated  with  minimum  MB  starts  only  at  a  radius  of  about  3.3.  Graph  B  in  Figure 
10  portrays  the  profile  of  this  valley.  Since  the  floor  of  this  valley  never  falls  below  an  error 
value  of  2.0,  it  represents  a  bad  valley  that  never  yields  a  solution. 

For  the  valley  associated  with  minimum  MC,  we  find  that  the  local  minimum  located  by  a 
steepest  descent  on  each  sphere  is  the  same  as  that  of  the  valley  associated  with  minimum  MA. 
We  therefore  conclude  that  minima  MA  and  MC  lie  in  the  same  class  of  valley.  The  reason  that 
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two  separate  minima,  MA  and  MC,  can  be  associated  with  one  valley  can  be  understood  by  look¬ 
ing  at  Figure  11.  This  figure  shows  the  contour  plot  of  a  scalar-valued  function  defined  over 
two-dimensional  space.  When  the  domain  of  this  function  is  restricted  to  the  region  inside  the 
square  box,  points  Ml,  M2  and  M3  become  local  minima  of  the  restricted  function.  If,  however, 
wemna  steepest  descent  from  the  minima  Ml,  M2  and  M3  while  limiting  the  search  space  to  the 
circle  shown,  then  these  th'ce  steepest  descents  all  stop  at  the  same  point  on  the  circle  labeled 
Min.  This  figure  is  an  oversimplification,  but  it  provides  a  model  for  the  real  situation  of  minima 
MA  and  MC. 


Radius 


Radius 


Fig.  9  The  Locus  of  the  Floor  of  the  Valley  Associated 
with  Minimum  MA  in  Table  1. 


Error 


Radius 


Fig.  10  The  Profiles  of  Valleys  on  the  Error  Surface  of 
Network  8.1  Performing  the  XOR  Mapping. 


To  obtain  some  information  about  the  shape  of  the  valleys,  we  studied  their  cross-section 
on  the  9-dimensional  spheres  around  the  origin.  We  first  located  the  floor  of  a  valley  on  a 
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particular  sphere.  Next,  we  defined  an  8-dimensional  hyper-plane  tangent  to  the  sphere  at  the 
intersection  point.  By  restricting  the  domain  of  the  error  function  to  this  hyper-plane,  we  derived 
eight  principal  curvature  axes  from  the  restricted  function.  Figure  12  shows  the  error  values 
along  an  arc  produced  by  the  intersection  of  the  plane  through  the  principal  curvature  axis  and  the 
origin;  the  lateral  deviation  is  measured  in  degree  from  the  floor  of  the  valley  along  the  arc. 


Fig.  1 1  A  Conceptual  Model  Illustrating  that  Limitation  of 
All  Weights  to  Fixed  Intervals  Can  Lead  to  Multiple  Minima. 


Next,  we  have  tried  to  obtain  some  measure  for  the  cross-section  of  a  valley  as  a  function  of 
R,  the  distance  from  the  origin.  For  each  error  value  profile  on  a  particular  sphere  (Fig.  12),  we 
locate  the  interval  (  Qupper ,  Slower  )  where  the  error  is  less  than  four,  i.e.,  the  error  value  at  the  ori¬ 
gin.  We  use  the  smaller  value  of  these  two  bounds  for  each  axis  and  call  it  the  half -width. 
Using  all  the  eight  half-widths,  we  calculate  the  area  of  an  8-dimensional  elliptical  patch  provid¬ 
ing  some  measure  of  a  cross-section  of  a  valley. 

4  g 

Area  of  elliptical  patch  =  8(^f|0; ), 

where  R  is  a  radius  of  the  sphere  and  9;  (  measured  in  radians  )  is  the  half-width  associated  with 

axis  i .  The  cross-sectional  areas  of  these  valleys  as  functions  of  a  sphere’s  radius  are  displayed  in 

Figure  13.  Our  measure  for  the  cross-section  of  valley  AC  displays  a  discontinuity  at  a  radius  of 

about  4.8.  This  can  be  understood  by  comparing  the  structure  of  the  cross-section  at  radius  4.7 

(Fig.  12a)  with  that  at  radius  4.9  (Fig.  12b).  Around  a  radius  of  4.8,  the  shoulders  of  one  of  the 

profiles  (  shown  as  a  solid  line  )  rise  above  4.0  and  thus  lead  to  a  drastic  reduction  in  the  cross- 

s 

section.  Figure  14  shows  the  graph  of  the  product  of  all  the  eight  angular  half-widths  ( JfJ0;  )  as 

The  volume  of  an  n-dimensionai  sphere  with  radius  R  is  equal  to  ^ —  where  I~(-y)  is  defined  to  be 

«r(y) 

v n(/i ill  when  n  is  odd,  and  (  -)!  when  n  is  even. 
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a  function  of  a  sphere’s  radius,  thus  providing  a  measure  of  the  fraction  of  the  area  of  the  total 
sphere  that  has  an  error  value  less  than  4.0.  This  view  which  is  more  sensitive  in  the  region 
around  the  origin  shows  a  second  discontinuity  in  valley  AC  at  a  radius  of  about  2.0.  This  occurs 
because  two  valleys  radiating  from  the  origin  merge  into  a  single  valley  at  this  point  (see  Fig.  9). 


Error 
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(b) 


Fig.  12  Cross-Section  of  the  Valley  Associated  with  Minimum  MA  : 

(a)  on  the  Sphere  with  Radius  4.7, 

(b)  on  the  Sphere  with  Radius  4.9. 
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7.  ERROR  SURFACES  OF  OTHER  NETWORKS  PERFORMING  THE  XOR  MAPPING 


7.1.  The  Error  Surfaces  of  Networks  8.2  and  8 3  Performing  the  XOR  Mapping 

To  gain  insight  into  changes  in  the  structures  of  an  error  surface  as  the  number  of  hidden 
neurons  in  a  network  increases,  we  study  two  other  networks  implementing  the  XOR  function; 
one  has  three  hidden  neurons  (Fig.  8.2),  and  the  other  has  four  (Fig.  8.3).  Network  8.2  differs 
from  network  8.1  by  the  additional  hidden  neuron  h  3  and  its  associated  weights  vv4,  w  n,  wn  and 
w13.  Similarly,  network  8.3  has  an  additional  hidden  neuron  h4  and  weights  w5,  w15,  w16  and 
w; 7.  To  investigate  the  error  surfaces  of  networks  8.2  and  8.3,  we  have  employed  the  approaches 
discussed  above  for  network  8.1. 

The  characteristics  of  the  region  around  the  origins  of  the  error  surfaces  of  networks  8.2  and 
8.3  are  similar  to  those  of  network  8.1.  The  errors  at  the  origins  of  the  error  surfaces  of  network 
8.2  and  8.3  are  4.0,  and  the  first  derivatives  at  their  origins  are  zero  in  every  direction.  We  also 
find  that  both  surfaces  have  positive  curvatures  along  the  w  i-axis,  and  that  near  the  origin  the 
error  along  this  axis  can  be  approximated  by 

4  +  w}. 


To  confirm  that  the  two  functions  again  form  saddle  points,  we  look  back  to  the  case  of  net¬ 
work  8.1.  If  the  weight  space  of  network  8.2  is  limited  to  Ws.i,  i.e.,  the  subspace  spanned  by  wi, 
wo,  W3,  W5,  vvg,  W7,  wg,  wg,  and  wio»  the  error  surface  of  network  8.2  will  be  identical  (except 
for  the  labeling  of  the  coordinates)  to  the  error  surface  of  network  8.1.  This  likeness  tells  us  that 
the  error  surface  of  network  8.1  is  a  subset  of  the  error  surface  of  network  8.2  when  both  net¬ 
works  represent  the  same  function.  In  section  6.1,  we  have  shown  that  the  error  surface  of  net¬ 
work  8.1  has  a  direction  where  the  error  is  reduced  when  we  move  away  from  the  origin. 
Because  the  error  surface  of  network  8. 1  is  a  subset  of  the  error  surface  of  network  8.2,  one  will 
find  the  same  behavior  along  this  direction  in  the  error  surface  of  network  8.2.  We  can  therefore 
conclude  that  the  origin  of  the  error  surface  of  network  8.2  is  also  a  multi-dimensional  saddle 
point.  By  the  same  process,  one  can  also  conclude  that  the  origin  of  the  error  surface  of  network 
8.3  is  a  multi-dimensional  saddle  point. 

By  probing  the  weight  space  with  random  starting  points,  followed  by  gradient  descent,  we 
find  four  types  of  minima,  denoted  by  MD,  ME,  MF  and  MG,  in  the  error  surface  of  network  8.2, 
and  eight  types  of  minima,  denoted  by  MH,  MI,  MJ,  MK,  ML,  MM,  MN  and  MO,  in  the  error 
surface  of  network  8.3.  The  representatives  of  these  minima  for  networks  8.2  and  8.3,  which  are 
points  in  the  representative  sectors,  are  listed  in  Tables  2  and  3,  respectively. 

In  network  8.2,  the  minima  of  type  MD  are  distinct  points  while  those  of  types  ME,  MF  and 
MG  are  sets  of  points  with  the  constant  error  values.  ME  and  MG  are  again  one  dimensional 
grooves,  while  MF  is  a  two  dimensional  surface.  One  interesting  point  about  minima  MF  is  that 
every  point  in  tins  type  of  minima  satisfies 


w4(  1  -expjw  u)) 
1  1  +  exp(wn) 


-2.86. 


The  expression  on  the  left  hand  side  of  this  equation  is  simply  the  threshold  of  the  output  neuron 
which  is  modified  by  adding  to  weight  w  1  a  constant  of 
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w4(  1  -exp(wn)) 

1  +  ex p(wu) 

This  constant  conies  from  the  fact  that  w  12  <tnd  w  13  of  every  point  in  minima  MF  are  zero.  As  a 
result,  the  signal  contributed  by  hidden  neuron  /13  to  the  output  neuron  is  a  constant,  thus  acting 
like  an  additional  threshold  of  the  output  neuron. 

Comparing  the  data  in  Tables  1  and  2,  we  find  that  the  minima  MF  contain  the  minimum 
point  MA  If  we  choose  the  point  in  minima  MF  where  the  weights  w4,  w  n,  w  12,  and  w13  that  are 
absent  in  network  8.1  are  all  zero,  becomes  -2.86,  and  all  the  corresponding  weights  of  the 
two  networks  are  the  same.  We  also  find  that  minima  MG  in  the  case  of  network  8.2  is  a 
transformed  version  of  minima  MB.  If  we  allow  w2  of  network  8.1  to  be  larger  than  d.O  while 
still  restricting  the  other  weights  in  the  range  of  [-3,  3],  minimum  MB  will  move  to  the  point 

(  W  1,  W2,  W3,  W4,  W5,  Wg,  W»7,  Wj,  W9) 

=  (-0.10,  3.10,-3.00,  3.00,  3.00,  1.04,  3.00,  3.00,-1.03). 

This  new  minimum  is  equivalent  to  the  point  in  minima  MG  where  w2  =  3.10  and  w4  =  0.0. 
Hence,  minimum  MB  is  transformed  into  a  minimum  of  class  MG  when  we  add  an  extra  hidden 
neuron  to  network  8.1,  and  it  remains  a  projection  of  MG  onto  the  subdomain  of  network  8.1. 

In  network  8.3,  minima  of  types  MH,  MJ,  MK  and  MN  are  distinct  points,  while  those  of 
types  MI,  ML  can  be  viewed  as  grooves  of  constant  error,  and  minima  MM  and  MO  are  planes  of 
constant  error.  Comparing  the  data  in  Table  2  and  3,  we  find  that  minima  ML,  MM,  MO  contain 
minima  MD,  ME,  and  MG,  respectively. 

The  characteristics  of  the  valleys  on  the  error  surface  of  network  8.2  are  illustrated  in  Fig¬ 
ures.  15,  16,  and  17.  There  are  only  two  types  of  valleys  on  this  error  surface  and  both  of  them 
are  good  valleys.  MD,  ME,  and  MG  are  all  associated  with  the  same  valley  (  valley  DEG  in  Fig. 
15).  The  cross-sectional  areas  of  these  two  valleys  are  displayed  in  Figure  16,  and  the  product  of 

13 

all  the  thirteen  angular  half- widths  (  )  is  shown  in  Figure  17.  The  cross-section  of  the  main 

valley  shows  several  discontinuities.  The  ones  at  radius  2.0  and  4.8  are  the  ones  already  known 
from  network  8.1  (Fig.  14).  New  discontinuities  at  radii  of  about  7  and  about  13  are  caused  by 
the  merging  of  two  valleys  into  one  and  by  an  abrupt  change  of  a  half-width,  respectively. 

The  characteristics  of  the  valleys  on  the  error  surface  of  network  8.3  are  displayed  in  Fig¬ 
ures  18,  19,  and  20.  One  point  about  valley  Ml  deserves  mentioning:  no  minimum  lies  inside 
valley  Ml,  because  valley  Ml  merges  into  valley  HLO  at  a  radius  of  about  7.5.  Thus,  a  steepest 
descent  that  runs  into  valley  Ml  will  eventually  slide  into  valley  HLO,  and  stop  at  a  minimum  in 
valley  HLO. 
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Table  2 

Minima  in  the  Error  Surface  of  Network  8.2 
Performing  the  XOR  Mapping 


Identifier 


MD  ME  MF 


Radius  from  the  Origin 


Multiplicity 


Percent  of  the  Weight  Space 
Covered  by  the  Collection  Zone 
of  the  Minimum 


Classification 


Percent  of  the  Weight  Space 
Covered  by  the  Collection  Zone 
of  the  class 


F 

DEG 

8.93  * 

8.42  * 

0.0803 

2.0965 

25.3 

1.5 

-3.00 

C 

A 

3.00 

B 

3.00 

3.00 

D 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

-3.00 

3.00 

-3.00 

2.92 

E 

-3.00 

0.00 

-3.00 

A+B  =3.42 

C  -t-  D  (  iigjgiR) 
'  1  +  exp (— £ ) 

F  +  G  =  3.10 

*  minimum  radius 

+  new  over  network  8.1 


)  =  -2.86 
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Table  3 

Minima  in  the  Error  Surface  of  Network  8 3  Performing  the  XOR  Mappii 

Identifier 

MH 

MI 

MJ 

MK 

ML 

MM 

MN 

MO 

Valley 

HLO 

IK 

J 

IK 

HLO 

M 

N 

HLO 

Error 

0.0005 

0.0213 

0.0236 

0.0356 

0.0605 

0.0665 

2.0808 

2.0965 

Radius  from  the  Origin 

12.00 

11.09* 

11.22 

11.40 

10.96* 

11.38* 

10.45 

9.41* 

Multiplicity 

8 

4 

16 

2 

4 

12 

8 

6 

Percent  of  the  Weight  Space 

■M 

HI 

0.5 

Covered  by  the  Collection 

48.6 

1 

38.6 

1.4 

0.5 

Zone  of  the  Minimum 

iMi 

1  1 

Wi 

0.00 

-3.00 

-3.00 

-2.97 

-2.20 

-3.00 

0.00 

-0.10 

W  2 

3.00 

A 

3.00 

3.00 

3.00 

E 

3.00 

H 

3.00 

B 

3.00 

3.00 

3.00 

F 

3.00 

I 

w* 

-3.00 

3.00 

3.00 

3.00 

C 

G+ 

-3.00 

J+ 

W  5 

-3.00 

-3.00 

3.00 

-3.00 

D+ 

3.00 

-3.00 

-3.00 

Wj 

3.00 

3.00 

1.49 

3.00 

3.00 

3.00 

3.00 

3.00 

w7 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

3.00 

w8 

3.00 

3.00 

3.00 

1.18 

2.70 

3.00 

1.04 

w9 

3.00 

3.00 

1.49 

3.00 

3.00 

3.00 

3.00 

3.00 

w10 

-3.00 

3.00 

3.00 

-3.00 

-2.70 

3.00 

3.00 

3.00 

wu 

-3.00 

3.00 

3.00 

-3.00 

-3.00 

3.00 

1.04 

Wl2 

3.00 

2.82 

1.49 

3.00 

3.00 

3.00+ 

3.00 

3.00+ 

w13 

3.00 

-3.00 

-3.00 

3.00 

3.00 

3.00+ 

3.00 

3.00+ 

VV14 

-3.00 

-3.00 

-3.00 

1.18 

-3.00 

3.00+ 

-0.54 

1.04+ 

wl3 

3.00 

0.68 

1.49 

3.00 

3.00+ 

2.92 

3.00 

3.00 

W  16 

-3.00 

0.84 

-3.00 

3.00 

3.00+ 

-3.00 

3.00 

3.00 

w17 

3.00 

0.84 

-3.00 

-1.17 

-3.00+ 

-3.00 

-0.54 

-1.03 

Classification 

good 

bad 

Percent  of  the  Weight  Space 
Covered  by  the  Collection 
Zone  of  the  class 

99.3 

0.7 

A  +  B  =  5.32;  C  +  D  =  -1.01;  E  +  F  +  G  =  3.42;  H+I  +  J  =  3.10 
*  minimum  radius 

+  new  over  network  8.2.  Note  that  the  added  weights  may  be  different  from  w5,  w13,  w]6  and  w17. 

This  occurs  because  the  points  listed  in  this  table  are  the  points  in  the  representative  sector. 

Weight  w5,  w  13l  w  16  and  w  17  are  mapped  to  other  weights  when  it  is  mapped  into  the  representative  sector. 

-37- 


Area  (log  scale) 


M 


Area  Gog  scale) 


Radius 


Radius 


Fig.  19  The  Cross-sectional  Areas  of  the  Valleys 
on  the  Error  Surface  of  Network  8.3  Performing  the  XOR  Mapping. 
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8.  ERROR  SURFACE  OF  NETWORK  8.4  PERFORMING  3-INPUT  PARITY  MAPPING 

Investigating  the  error  surface  of  network  8.4,  computing  the  three-input  parity  function,  we 
find  that  the  error  at  the  origin  is  8  and  the  first  derivatives  at  the  origin  are  zero  in  every  direc¬ 
tion.  The  curvatures  of  the  surface  at  the  origin  are  zero  except  for  the  curvature  along  axis  wi. 
The  error  near  the  origin  along  axis  w  i  can  be  approximated  by 

8  +  2w  ?. 


At  the  origin  we  find  that  along  the  direction 

(  w  1,  W2,  VV3,  VV4,  VV5,  W6,  H>7,  WS,W9,  WlO,  W  11,  W  12,  w  13,  W  14,  W  15,  W  16  ) 

=  (  0,  1,  0,  0,  -1,  1,  1,-1,  0,  0,  0,  0,  0,  0,  0,  0),  (23) 


the  error  can  be  approximated  by 

8  -  10e4, 

where  e  is  a  distance  from  the  origin  along  the  direction  defined  by  (23).  Because  there  is  a 
down-hill  path  at  the  origin,  the  error  surface  has  a  saddle  point  at  the  origin. 

Employing  the  search  method  used  in  the  case  of  network  8.1,  we  found  nine  types  of 
minima.  The  representatives  of  these  nine  types  are  listed  in  Table  4.  This  table  tells  us  that 
minima  of  types  MP,  MQ,  MR,  MV  and  MW  are  distinct  points,  while  those  of  types  MS,  MT 
and  MU  are  grooves  of  constant  error  values.  On  this  error  surface,  we  found  three  valleys,  the 
characteristics  of  which  are  illustrated  in  Figures  21,  22  and  23.  Figure  21  indicates  that  of  the 
three  valleys,  two  are  good,  since  their  floors  approach  zero  when  one  proceeds  far  enough  from 
the  origin.  The  good  valley  (P-X),  associated  with  minima  MP,  MQ,  MS,  MT,  MU,  MV  and 
MX,  starts  at  the  origin.  The  second  good  valley  (R),  associated  with  minimum  MR,  starts  at  a 
distance  of  about  7.0  from  the  origin.  The  valley  (W),  associated  with  minimum  MW,  starts  at  a 
distance  of  about  6  from  the  origin.  Its  floor  is  not  close  to  zero;  it  is  therefore  a  bad  valley. 
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Table  4 

Minima  in  Error  Surface  of  Network  8.4 
Performing  3-input  Parity  Mapping 


Identifier 

MP 

MQ 

MR 

MS 

MV 

MW 

MX 

Valley 

P-X 

P-X 

R 

B 

P-X 

P-X 

B 

w 

P-X 

Error 

0.1551 

B 

3.3473 

3.4420 

5.9983 

6.0635 

7.1144 

Radius 

from 

9.63 

9.49 

9.20 

8.24* 

8.42* 

8.92 

8.27 

7.53 

the  Origin 

Multiplicity 

■ 

8 

12 

12 

24 

8 

4 

6 

2 

Percent  of  the  Weight  Space 

Covered  by  the  Collection 

85.2 

8.7 

0.3 

1.3 

2.3 

0.2 

0.3 

0.2 

Zone  of  the  Minimum 

Wl 

0.00 

0.00 

0.00 

-1.60 

0.00 

0.00 

-Z74 

-3.00 

1.30 

w2 

-3.00 

3.00 

3.00 

3.00 

C 

E 

3.00 

1.84 

Z14 

w2 

-3.00 

3.00 

-3.00 

A 

D 

F 

3.00 

1.84 

-0.95 

w4 

-3.00 

-3.00 

-3.00 

B 

-3.00 

-3.00 

-3.00 

1.84 

-3.00 

w5 

0.00 

0.00 

3.00 

3.00 

0.00 

0.00 

3.00 

3.00 

3.00 

w6 

3.00 

3.00 

1.45 

1.75 

3.00 

3.00 

1.77 

1.69 

1.56 

W7 

3.00 

1.45 

1.75 

3.00 

3.00 

1.13 

1.51 

1.56 

Wg 

2.99 

0.00 

1.45 

1.75 

3.00 

1.05 

1.13 

1.51 

0.32 

W9 

0.00 

3.00 

0.82 

0.00 

0.00 

3.00 

3.00 

3.00 

W10 

1.18 

-3.00 

-1.45 

3.00 

3.00 

3.00 

1.77 

-1.51 

-1.25 

wn 

-1.18 

-3.00 

-1.45 

3.00 

3.00 

3.00 

-1.13 

1.51 

-1.25 

wl2 

-3.00 

-3.00 

-1.45 

3.00 

3.00 

1.05 

-1.13 

-1.69 

-1.09 

W13 

0.00 

0.00 

0.00 

0.82 

0.00 

0.00 

267 

3.00 

3.00 

w14 

-3.00 

3.00 

3.00 

3.00 

-0.78 

-3.00 

2.94 

-1.51 

1.43 

W15 

-3.00 

-3.00 

3.00 

3.00 

-0.78 

-3.00 

0.00 

-1.69 

1.43 

W16 

2.99 

-3.00 

3.00 

3.00 

-0.78 

1.05 

0.00 

1.51 

-0.44 

Classification 

good 

bad 

Percent  of  the  Weight  Space 

Covered  by  the  Collection 

95.4 

4.6 

Zone  of  the  class 

A  +  B  ~  -2.53;  C  +  D  -  2.47;  E  +  F  =  -3.01 
*  minimum  radius 
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Error 


Fig.  21  The  Profiles  of  Valleys  on  the  Error  Surface  of 
Network  8.4  Performing  the  3-Input  Parity  mapping. 


Radius 
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rT9i  Qog  scale) 


16 

Fig.  23  The  Product  of  All  the  Sixteen  Half-widths  ( j^0;  )  of  Valleys 
on  the  Error  Surface  of  Network  8.4  Performing  the  3-Input  Parity  Mapping. 


Radius 
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9.  WEIGHT  CONTAINMENT 

Our  study  of  some  sample  error  surfaces  have  shown  that  all  minima  on  these  surfaces  are 
at  infinity.  This  characteristics  is  not  desirable  because  actual  weights  are  typically  limited  to 
finite  values.  There  are  several  methods  to  produce  stable,  good  minima  at  the  finite  distance 
from  the  origin.  Besides  limiting  the  weight  space  to  be  a  compact  set,  we  can  move  the  loca¬ 
tions  of  all  minima  to  a  finite  distance  from  the  origin  by  using  weight  decay  or  by  setting  the 
goal  values  of  the  output  function  to  be  0.9  or  -0.9  rather  than  ±1.0.  The  update  formula  with 
weight  decay  is: 

w*+i  =  w„  -qV£>r(w)-5w, 

where  rt  and  5  are  some  constants.  [8]  recommends  5  for  the  XOR  task  to  be  0.0001. 

To  see  the  correlation  among  the  three  weight  containment  methods  mentioned  above,  we 
run  a  steepest  descent  from  each  minimum  on  the  error  surface  of  network  8.1  perrorming  the 
XOR  task.  These  three  weight  containments  are  used  to  prevent  the  steepest  descent  from  going 
too  far  from  the  origin.  Since  the  minima  of  class  MC  are  not  distinct  points,  we  chose  three 
separate  points:  the  point  where  w2  =  W3  =  -0.81,  the  point  where  w2  =  -1.62  and  W3  =  0.0,  and 
the  point  where  w2  =  0.0  and  w3  =  -1.62.  Geometrically,  the  first  point  is  a  symmetry  point  in  the 
middle  of  the  groove,  while  the  other  two  points  are  more  general,  asymmetric  solutions.  The 
data  gained  from  this  experiment  are  illustrated  in  Table  5.  This  data  indicates  that  the  results 
derived  from  all  three  weight  containments  are  consistent;  if  one  scheme  locates  a  good 
minimum,  the  other  schemes  will  also  locate  a  corresponding  good  minimum. 

In  order  to  compare  these  results  with  the  minima  found  by  limiting  the  weight  domain,  we 
also  terminated  the  steepest  descent  from  each  minimum  at  the  surface  of  a  cube  of  half-size  5 
(C-5)  and  also  with  a  sphere  of  radius  18  (R-18).  Examining  the  minima  at  R-18,  we  find  that  the 
ones  corresponding  to  points  MA  and  MB  have  error  values  of  0.000001  and  of  2.000014,  respec¬ 
tively.  This  data  is  consistent  with  the  fact  that  point  MA  is  in  the  good  valley  and  point  MB  is 
in  the  bad  valley,  and  thus  the  steepest  descents  starting  from  these  two  points  should  end  up  in 
different  minima. 

For  the  minimum  associated  with  point  MC,  we  find  it  approaches  the  minimum  associated 
with  point  MA  except  when  we  start  from  the  symmetry  point  MC  where  w2  =  w3.  The  trajec¬ 
tories  of  the  steepest  descents  strating  from  MA  and  MC  get  closer  to  each  other  the  further  we 
pursue  from  the  origin.  The  fact  that  they  approach  each  other  only  very  gradually  indicates  that 
the  associated  valley  has  a  very  broad  and  flat  floor  further  out  from  the  origin.  When  we  start 
from  the  symmetry  point  MC  with  w2  =  W3  =  -0.81,  we  slide  down  on  the  crest  of  the  ridge  to  a 
point  with  an  error  value  of  2.6673.  This  is  not  an  actual  minimum.  However,  since  the  weights 
at  this  point  which  is  one  of  the  symmetry  planes  of  the  error  surface  have  the  following  sym¬ 
metries:  w 2  =  W3,  w4  =  W7,  w 5  =  w8,  and  ws  =  W9,  the  trajectory  will  maintain  this  symmetry 
state  unless  we  perturb  it  deliberately,  e.g.,  by  eliminating  one  of  the  equalities  listed  above. 
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Table  5-1 

Comparison  of  the  Stopping  Point  of  Different  Learning  Algorithms 
Starting  from  Point  MA  in  Network  8.1. 


Method 

ST 

C-5 

SI 

WD 

G0.9 

Error 

0.0803 

0.0009 

0.000001 

0.0019 

2.6xl0~19 

w  i 

2.86 

4.97 

8.29 

4.87 

3.21 

-3.00 

-5.00 

-8.71 

-5.57 

-3.39 

W3 

-3.00 

-8.71 

-5.57 

-3.39 

W  4 

3.00 

5.00 

3.93 

2.54 

3.02 

W5 

3.00 

5.00 

4.26 

2.81 

3.11 

w6 

-3.00 

-5.00 

-4.26 

-2.81 

-3.11 

W7 

3.00 

5.00 

3.92 

2.54 

3.02 

Wj 

-3.00 

-5.00 

-4.26 

-2.81 

-3.11 

W  9 

3.00 

5.00 

4.26 

2.81 

3.11 

ST 

Starting  point  (on  cube  with  half-size  3) 

C-5 

Cube  with  half-size  5 

R-18 

Sphere  with  radius  18 

WD 

Weight  decay 

G0.9 

Goal  value  of  0.9  or  -0.9 

Table  5-2 

Comparison  of  the  Stopping  Point  of  Different  Learning  Algorithms 
Starting  from  Point  MB  in  Network  8.1. 


Method 

ST 

C-5 

R-18 

WD 

Error 

2.0969 

2.0028 

2.000014 

2.0028 

Wt 

0.00 

0.00 

0.00 

W2 

3.00 

5.16 

3.48 

W  3 

-3.00 

-5.16 

-3.48 

vv4 

3.00 

5.00 

8.12 

5.19 

tvs 

3.00 

5.00 

8.12 

5.19 

1.05 

1.08 

1.83 

1.61 

W-j 

3.00 

5.00 

8.12 

5.19 

wg 

3.00 

5.00 

8.12 

5.19 

Wg 

-1.05 

-1.08 

-1.83 

-1.61 

ST  Starting  point  (on  cube  with  half-size  3) 

C-5  Cube  with  half-size  5 

R-18  Sphere  with  radius  18 

WD  Weight  decay 

G0.9  Goal  value  of  0.9  or  -0.9 


0.00 

3.70 

-3  70 
8 
8 
0 
8 
8 
-0 
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Table  5-4 

Comparison  of  the  Stopping  Point  of  Different  Learning  Algorithms 
Starting  from  Point  MC  where  w;  =  -1.62  and  w  3  =  0.0  in  Network  8.1. 
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Table  5-5 

Comparison  of  the  Stopping  Point  of  Different  Learning  Algorithms 
Starting  from  Point  MC  where  W2  =  0.0  andn>3  =  -1.62  in  Network  8.1. 


Method 


ST 


C-5 


R-18 


WD 


G0.9 


Error 


2.8835 


0.0009 


1.3X10-6 


0.0075 


2.4xl0~19 


W  2 
W] 


0.82 

0.00 

-1.62 


4.97 

5.00 

-5.00 


8.17 

8.57 

-8.54 


4.87 

5.57 

-5.57 


3.09 

3.56 

-3.61 


w4 

w6 

w-j 

w8 

W  9 


3.00 

3.00 

-3.00 

3.00 

3.00 

-3.00 


-5.00 

5.00 

-5.00 

5.00 

5.00 

-5.00 


-3.72 

4.09 

-4.09 

4.44 

4.69 

-4.69 


-2.54 

2.81 

-2.81 

2.54 

2.81 

-2.81 


-1.77 

3.06 

-3.06 

4.21 

3.86 

-3.86 


ST  Starting  point  (on  cube  with  half-size  3) 

C-5  Cube  with  half-size  5 

R-18  Sphere  with  radius  18 

WD  Weight  decay 

G0.9  Goal  value  of  0.9  or  -0.9 
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10.  GENERAL  STRUCTURES  OF  ERROR  SURFACES  ASSOCIATED  WITH  THE 
PARITY  FUNCTIONS 

10.1.  Area  around  the  Origin 

For  an  error  surface  of  a  layered  feed-forward  network  performing  a  boolean  function,  the 
error  at  the  origin  has  a  moderate  bad  value.  Specifically,  for  a  boolean  function  with  g  input- 
output  patterns  and  ON  and  OFF  states  of  1  and  -1,  the  error  at  the  origin  equals  g .  This  value 
occurs,  because  at  the  origin,  where  all  the  weights  are  zero,  the  output  of  the  network  is  zero. 
Therefore,  the  square  of  the  difference  between  the  actual  output  and  the  goal  output  is  1,  and  the 
sum  of  the  squares  of  the  difference  is  g . 

For  an  error  surface  derived  from  the  parity  function,  the  region  around  the  origin  is  very 
flat,  because  the  first  derivatives  at  the  origin  are  zero  in  all  directions.  To  show  this,  we  begin  by 
approximating  the  output  of  the  network  in  a  region  close  to  the  origin  by 

O(ip)  —  '2"vvi  — 


For  a  parity  function  of  m  inputs,  the  function  consists  of  2m  input-output  patterns.  Denoting  i 
input-output  pattern  by  ( i; ,  gL  ),  the  error  near  the  origin  can  be  approximated  by 

Err  =  2^  ( gi  -  —  )2- 


We  note  that  the  goal  output  of  the  function  is  either  1  or  -1,  and  the  number  of  the  input-output 
patterns  with  output  of  1  and  -1  are  the  same  and  equal  2n~l.  Hence,  the  error  function  becomes 

Err  =  2m-1(l-yW1-^-w?  )2  +  2m-1(-l-yWi-^rwP  )2  (24) 

Simplifying  this  expression,  we  have 

Err  =  2m  -  2 m~2w  }  +  e(w  ? ). 

where  e( )  is  a  polynomial  function  with  a  smallest  degree  of  3.  Taking  the  first  derivative  with 
respect  to  w  |,  we  have 

d(T>rl  =  +  e'(w?). 

At  the  origin,  where  wt  =  0,  the  first  derivative  of  the  error  with  respect  to  is  zero.  For  the 
first  derivatives  with  respect  to  the  other  weights,  they  must  also  be  zero  because  the  error  func¬ 
tion  (24)  is  a  function  of  w  i  only. 

10.2.  General  Structure  of  Error  Surfaces  Associated  with  the  Parity  Functions 
The  following  characteristics  appeared  in  all  the  cases  we  studied: 

1  Symmetries  exist  in  an  error  surface.  Any  feature  of  the  error  surface  must  be  thought 
of  as  a  class  of  equivalent  configurations. 
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2  At  the  origin  an  error  surface  has  a  moderately  bad  error.  Close  to  the  origin  the  sur¬ 
face  is  quite  level  and  then  turns  upward  in  every  direction  except  for  a  few  directions 
where  shallow  valleys  run  out  from  the  origin.  The  floors  of  these  valleys  get  deeper 
and  approach  zero.  One  must  be  aware  that  the  "origin"  of  weight  space  is  an  artifact. 
In  other  problem  where  a  switching  function  is  an  odd  function  plus  a  constant  (which 
is  typical  when  ON  =  1  and  OFF  =  0),  all  the  features  of  the  error  surface  will  be 
modified  according  to  the  constant  term. 

3  Several  additional  valleys  originate  further  away  from  the  origin  and  then  radiate  out. 
Some  of  these  valley  floors  approach  zero  as  they  ran  further  out. 

4  There  are  no  actual  minima  except  at  infinity.  This  characteristics  comes  from  the 
fact  that  the  sigmoid  function  used  in  our  study  only  asymptotically  approaches  ±1 
only  at  infinity.  However,  we  can  create  minima  in  a  bounded  region  by  restricting 
the  weight  space  to  a  compact  set,  by  weight  decay,  by  setting  the  the  goal  values  to 
be  -0.9  or  0.9,  by  using  a  hard  limiter  (Fig.  2)  as  a  switching  function,  or  by  several 
other  methods. 

10.3.  Effect  of  the  Number  of  Hidden  Neurons 

The  number  of  hidden  neurons  in  a  layered  feed-forward  network,  affects  the  potential  and 
the  learning  efficiency  of  the  network  in  performing  a  desired  function.  A  network  can  perform 
an  assigned  function  if  there  is  a  minimum  on  the  error  surface  deep  enough  to  constitute  a  good 
solution  which  corresponds  to  a  set  of  weights  that  allows  the  network  to  perform  an  assigned 
function.  Comparing  the  errors  at  the  deepest  minima  on  the  error  surfaces  of  network  8.1,  8.2 
and  8.3  in  the  XOR  task  (Table  6),  we  see  that  the  error  surface  of  a  network  with  more  hidden 
neurons  has  a  smaller  error  at  its  deepest  minimum  than  the  error  surface  of  a  network  with  fewer 
hidden  neurons.  In  effect,  a  network  with  more  hidden  neurons  can  perform  the  XOR  function 
better  than  a  network  with  fewer.  This  statement  is  also  true  for  other  functions.  We  therefore 
conclude  that  the  accuracy  with  which  a  network  can  approximate  a  given  function  is  higher  for 
networks  with  more  hidden  neurons. 


Table  6 

The  Deepest  Minimum  on  the  Error  Surfaces  of 

Networks  8.1,  8.2  and  8.3  Performing  the  XOR  Function. 

Network 

8.1 

8.2 

8.3 

Number 

of 

Hidden  Neurons 

2 

3 

4 

The  Error 
of 

the  Deepest  Minimum 

0.0803 

0.0605 

0.0005 

To  enable  a  layered  feed-forward  network  to  perform  a  desired  function,  a  proper  set  of 
weights  must  be  assigned  to  the  network.  One  algorithm  for  finding  such  a  set  is 
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back -propagation  [2],  This  algorithm  is  an  iterative  process  that,  for  infinitely  small  step  size, 
approximates  steepest  descent  in  the  weight  space  of  a  network.  If  the  minimum  found  has  a 
small  enough  error,  the  network  has  learned  to  perform  the  assigned  function,  otherwise,  the  net¬ 
work  is  stuck  and  has  failed  to  leam. 

If  the  starting  point  of  the  back-propagation  process  is  chosen  randomly  in  weight  space,  the 
chance  that  back-propagation  will  locate  a  particular  minimum  is  proportional  to  the  percent  of 
the  weight  space  covered  by  the  collection  zone  of  that  minimum.  That  is,  the  success  rate  of 
back-propagation  with  a  random  starting  point  in  training  a  network  is  proportional  to  the  percent 
of  the  weight  space  covered  by  the  collection  zones  of  all  the  good  minima.  This  success  rate  is  a 
measure  of  how  likely  it  is  that  a  network  will  leam  to  perform  an  assigned  task  when  using  a 
steepest  descent  algorithm. 

Comparing  the  fraction  of  the  weight  spaces  covered  by  the  collection  zones  of  all  the  good 
minima  in  networks  8.1,  8.2,  and  8.3  (Table  7),  we  see  that  the  percentages  increase  from  net¬ 
work  8.1  to  8.3.  These  examples  indicate  that  the  fraction  of  weight  space  covered  by  all  the 
good  minima  is  higher  in  a  network  with  more  hidden  neurons.  We  therefore  conclude  that  the 
learning  success  probability  of  a  network  can  be  increased  by  adding  hidden  neurons  to  the  net¬ 
work. 


Table  7 

Percent  of  the  Weight  Space  Covered  by 
the  Collection  Zones  of  All  the  Good  Minima 
in  Networks  8.1,  8.2  and  8 3. 

Network 

8.1 

8.2 

8.3 

Number 

of 

Hidden  Neurons 

2 

3 

4 

Percent  of  the  Weight  Space 
Covered  by 

the  Good  Collection  Zone 

60.3 

93.9 

99.3 
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11.  METHODS  FOR  LOCATING  GOOD  MINIMA 

Considering  the  information  obtained  from  this  study  of  several  error  surfaces,  we  suggest 
three  methods  for  increasing  the  chance  of  finding  a  good  minimum  in  a  bounded  weight  space 
while  restricting  ourselves  to  gradient  descent  on  the  error  surface.  The  success  rate  can  be 
further  increased  by  various  dynamic  methods,  e.g.,  by  taking  large  steps  of  varying  size,  by 
using  the  back-propagation  with  a  momentum  term,  by  adding  noise  to  the  system,  or  in  several 
other  ways.  This  issue  is  beyond  the  scope  of  this  paper,  which  considers  only  the  static  nature  of 
the  error  surface. 

A  first  method  exploits  the  fact  found  in  the  parity  networks  that  bad  valleys  usually  start 
further  out  from  the  origin,  and  that  the  valleys  starting  near  the  origin  lead  to  good  minima.  In 
this  approach,  we  first  construct  a  small  sphere  centered  at  the  origin.  We  then  locate  a  minimum 
on  this  sphere  with  the  method  discussed  in  section  6.3.  From  there  we  run  a  steepest  descent  to 
a  minimum  in  the  weight  space.  If  we  define  the  radius  of  the  searched  sphere  to  be  less  than  the 
distance  between  the  origin  and  the  starting  point  of  any  bad  valleys,  we  ensure  that  the  steepest 
descent  starting  from  the  minimum  on  the  sphere  will  definitely  lead  to  a  good  minimum. 

However,  we  usually  do  not  have  priori  knowledge  about  bad  valleys.  Also,  finding  the 
minimum  on  a  sphere  is  complex  and  expensive  to  compute.  We  therefore  recommend  a  simpler 
version  of  the  above  method.  This  method  uses  steepest  descent  starting  from  a  point  randomly 
chosen  near  the  origin.  This  will  significantly  increase  the  chance  of  sliding  first  into  a  good  val¬ 
ley.  Starting  steepest  descent  near  the  origin  also  decreases  the  chance  of  being  trapped  in  bad 
minima  whose  collection  zones  are  small  and  mostly  concentrated  in  the  regions  near  the  boun¬ 
daries  of  the  weight  domain.  Examples  of  such  minima  are  minimum  MC  in  network  8.1, 
minimum  MG  in  network  8.2,  and  minimum  MO  in  network  8.3. 

We  used  some  statistical  probing  to  verify  the  statement  above.  For  each  error  surfaces  stu¬ 
died  earlier,  we  ran  steepest  descents  from  a  thousand  starting  points  randomly  distributed  in  a 
cube  of  a  given  half-size.  The  percentages  of  the  steepest  descents  stopping  at  each  minimum  in 
a  weight  domain  corresponding  to  a  cube  of  half-size  3.0  is  given  in  Table  B-l  through  B-4  in 
appendix  B.  Comparing  these  data  with  those  in  Table  1,  2  and  3,  we  find  that  the  steepest  des¬ 
cents  with  starting  points  distributed  in  a  cube  of  half-size  1.5  has  a  better  chance  of  locating  a 
good  minimum  than  those  with  starting  points  distributed  in  a  cube  of  half-size  3  (see  Table  8). 
Lippeman(3]  also  suggests  that  the  starting  weights  of  back-propagation  should  be  small. 

The  restriction  of  a  starting  point  near  the  origin  has  one  disadvantage.  Since  the  gradient 
of  an  error  function  in  an  area  around  the  origin  is  small,  a  steepest  descent  must  ran  many  itera¬ 
tions  to  get  out  of  that  area  and  so  it  slows  down  the  convergence  rate. 
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Table  8 

The  Comparison  of  Two  Classes  of  Steepest  Descents  in  Searching  the  Minima 

A)  The  Starting  Points  of  the  Processes  are  randomly  distributed  in  a  cube  of  half-size  1.5. 

B)  The  Starting  Points  of  the  Processes  are  randomly  distributed  in  a  cube  of  half-size  3.0. 


Network 

8.1 

8.2 

8.3 

Percent  of  Trajectories 
in  Class  A 

Approaching  good  Minima 

74.1 

99.2 

100 

Percent  of  Trajectories 
in  Class  B 

Approaching  good  Minima 

60.3 

93.9 

99.3 

The  third  method  is  useful  where  the  weight  space  is  bounded  by  a  cube.  This  method 
assumes  that  all  minima  lie  close  to  comers  of  the  cube,  and  that  large  gradients  exist  in  the 
region  further  out  from  the  origin.  Exploiting  these  facts,  we  start  our  steepest  descents  near  the 
comers  of  the  cube.  We  may  need  to  run  several  steepest  descents  before  we  can  locate  a  good 
minimum.  However,  we  will  still  be  able  to  find  a  good  minimum  rather  quickly,  because  each 
run  will  be  very  fast. 
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12.  CONCLUSION 

Layered  feed-forward  networks  have  recently  received  renewed  interest  because  of  their 
ability  to  represent  complex  mappings.  This  ability  is  completely  described  by  the  corresponding 
error  surface  .  Such  an  error  surface  can  be  obtained  by  plotting  the  maximum  error  or  the  sum 
of  the  squares  of  the  errors,  where  the  individual  errors  are  the  deviation  of  the  network’s  actual 
output  from  its  desired  goal  function.  The  error  surfaces  of  a  network  with  W  weights  are  W- 
dimensional  surfaces  embedded  in  a  W +1 -dimensional  space  and  defined  over  a  weight  space  ,  a 
W -dimensional  domain  given  by  the  ranges  of  the  W  weights.  The  error  surface  can  be  viewed 
as  a  measure  of  the  deviation  of  a  desired  mapping  from  a  mapping  represented  by  the  network 
for  all  possible  combinations  of  weight  strengths.  Understanding  the  structure  of  the  error  surface 
will  help  us  answer  many  questions  about  the  network’s  properties. 

In  this  report,  we  have  chosen  to  study  a  class  of  simple  three-layer  networks  with  complete 
connectivity  between  adjacent  layers.  We  have  found  that  error  surfaces  of  these  networks  exhi¬ 
bit  two  kinds  of  symmetries.  The  first  symmetry  occurs,  because  the  signs  of  all  the  weights  con¬ 
nected  to  any  hidden  neuron  can  be  inverted  simultaneously  without  changing  the  behavior  of  the 
networks.  The  second  symmetry  occurs,  because  the  connection  patterns  of  all  neurons  in  the 
hidden  layer  are  identical;  thus  any  permutation  of  these  neuron  roles  leads  to  the  same  network 
behavior.  When  these  networks  perform  the  Exclusive-Or  or  the  parity  functions,  their  error  sur¬ 
faces  have  two  other  symmetries.  The  first  arises  from  arbitrary  ordering  of  the  input  neurons, 
and  the  second  arises  from  the  fact  that  simultaneous  complementations  of  any  two  inputs  do  not 
change  the  definition  of  the  functions. 

Knowing  these  symmetries,  we  can  define  the  region  of  a  weight  space  such  that  the  error 
surface  defined  over  this  chosen  fraction  possesses  every  aspect  of  the  entire  error  surface.  This 
region,  referred  to  as  a  representative  sector ,  can  be  located  by  recursively  subdividing  the 
weight  space  by  the  implied  hyper-planes.  This  representative  sector  is  important,  because  it 
reduces  the  scope  of  the  visualization  task,  leaving  only  the  region  in  the  representative  sector  to 
be  explored  in  order  to  gain  full  understanding  of  the  behavior  of  the  network. 

We  have  studied  in  detail  a  few  small  networks  performing  the  Exclusive-OR  function  of  2 
or  3  inputs.  For  each  case,  we  found  that  the  surface  at  the  origin  is  a  saddle  point,  and  the  area 
around  the  origin  is  very  level.  By  limiting  the  weight  space  to  the  region  within  a  cube  with 
half-size  3,  we  then  employed  steepest  descent  with  random  starting  points  to  search  for  all 
minima  on  the  error  surface.  For  each  minimum  found,  we  determined  its  location,  its  error 
value,  and  the  characteristics  of  the  associated  valley.  We  have  also  found  that  the  minima  found 
within  a  restricted  weight  space  are  consistent  with  those  found  by  employing  weight  decay  or  by 
using  goal  values  of  0.9  or  -0.9  to  limit  the  range  of  all  the  weights. 

Comparing  the  error  surfaces  of  the  sample  networks,  we  found  that  the  accuracy  with 
which  a  network  can  approximate  a  given  function  and  the  learning  success  probability  are  higher 
for  a  network  with  more  hidden  neurons.  Finally,  three  methods  for  increasing  the  chance  of 
finding  a  good  minimum  when  using  gradient  descent  emerged  from  the  understanding  of  the 
structure  of  the  error  surface  gained  in  this  study. 
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APPENDIX  A  Symmetry  Operations 


Table  A-l 

Symmetry  Operations  Associated  with  the  Error  Surface 

of 

Network  8.1  Performing  the  XOR  mapping 


Notation 


Description 


w2  - 
w3  ■ 

w2  < 

w5  • 

w5 

w6 


■  -w2,  w4  • 

•  -w3,  w7  • 

— >  w3,  w4  <- 

— >  w6,  w8  <r 

— >  -w5,  w8  < 
-w6,  w9  ■ 


'  -w4,  w5 
•  -w7.  w8 


.  -w5,  w6  ■ 
.  -w8,  w9  ■ 


.  -w6 
.  -w9. 


•  w7,  w5  <- 

.  w9. 

-w8,  wl 
-»  -w9,  wl 


•  w8,  w6  > 


-wl,  w2  < 
-wl,  w2  < 


•  w9, 


■  -w2,  w3  • 
•  -w2,  w3  ■ 


•  -w3. 

•  -w3. 


Number  of  Equivalent  Configurations  64 


Representative  Sector 


w4  >  w7  >  0;  w5  >  w6  >  0 
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Table  A-2 

Symmetry  Operations  Associated  with  the  Error  Surface 

of 

Network  8.2  Performing  the  XOR  mapping 


Notation 


opHxl32 

opHx213 

opHx231 

opHx312 

opHx321 


w2  -*  -w2,  w5  - 
w3  -*  -w3,  w8  - 
w4  -w4,  wll 


.  -w5,  w6  — 
•  -w8,  w9  — 
»  -wll,  wl2 


-w6,  w7  — >  -w7 
.  -w9,  wlO  — >  -wtO 
->  -wl2,  wl3  — »  -wl3 


w3  * — >  w4,  w8  < - >  wll,  w9  * - >  wl2,  wlO  < - >  wl3 

w3  < _ ►  w4,  w8  * — *  wll,  w9  < — >  wl2,  wlO  < - »  w!3 

w3  ->  w2,  w4  -»  w3,  w2  -►  w4,  w8  ->  w5,  w9  -»  w6,  wlO  — *  w7, 
wll  w8,  wl2  -»  w9,  wl3  -»  wlO,  w5  -»  wll,  w6  ->  wl2,  w7  -» wl3 
w4  — >  w2,  w2  — >  w3,  w3  — >  w4,  wl  1  — >  w5,  wl2  -»  w6,  wl3  ->  w7, 
w5  _>  w8,  w6  — >  w9,  w7  -»  wlO,  w8  -»  wll,  w9  -*  wl2.  wlO  -»  wl3 
w2  < - >  w4,  w5  ♦ — >  wl  1,  w6  < — »  wl2,  w7  < - >  wl3 


.  w7,  w9 


■  wlO,  wl2  • 


wl  — >  -wl,  w2  — 
w6  — »  -w6,  w9  - 
wl  — >  -wl,  w2  - 
w7  — >  -w7,  wlO 


.  -w2,  w3  — >  -w3,  w4  ->  -w4, 
.  -w9,  wl2  — >  -wl2 
•  -w2,  w3  — >  -w3,  w4  — » -w4, 
->  -wlO,  wl3  — >  -wl3 


Number  of  Equivalent  Configurations  384 


Representative  Sector 
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Table  A-3 

Symmetry  Operations  Associated  with  the  Error  Surface 
of 

Network  8J  Performing  the  XOD  mapping 

Notation 

Description 

opHl 

w2  — >  -w2,  w6  — > -w6,  w7  — >  -w7,  w8  — >  -w8 

opH2 

w3 -4 -w3,  w9  — 4-w9,  wlO  — 4 -wlO,  wll  — 4 -wll 

opH3 

w4  —4  -w4,  wl2  — 4  -wl2,  wl3  —4  -wl3,  wl4  — »  -w!4 

opH4 

w3  — 4 -w5,  wl5  —4  -wl5,  wl6  — 4  -wl6,  wl7  — >  -w!7 

opHxl243 

w4  < — 4  w5,  wl2  < — 4  wl5,  w!3  < — 4  wl6,  wl4  < — 4  wl7 

opHxl324 

w3  < _ 4  w4,  w9  < — 4  wl2,  wlO  « >  wl3,  wll  < >  wl4 

opHxl342 

w4  —4  w3,  w5  —4  w4,  w3  —4  w5,  wl2  — >  w9,  wl3  — »  wlO,  w!4  — 4  wll, 

W15  _>  wi2,  wl6  ->  wl3,  wl7  -4  wl4,  w9  -4  wl5,  wlO  -4  wl6,  wll  -4  wl7 

opHxl423 

w5  _>  w3,  w3  -4  w4,  w4  -4  w5,  wl5  -4  w9,  wl6  ->  wlO,  wl7  -4  wll, 
w9  -4  wl2,  wlO  -4  wl3,  wll  -4  wl4,  wl2  -4  wl5,  wl3  -4  wl6,  wl4  -4  wl7 

opHxl432 

w3  < — 4  w5,  w9  < — 4  wl5,  wlO  < — 4  wl6,  wll  < - >  wl7 

opHx2134 

w2  < - >  w3,  w6  < - >  w9,  w7  < — »  wlO,  w8  < - >  wll 

opHx2143 

w3  ->  w2,  w2  -»  w3,  w5  -»  w4,  w4  ->  w5,  w9  -»  w6,  wlO  ->  w7, 
wll  — »  w8,  w6  — »  w9,  w7  — >  wlO,  w8  — >  wll,  wl5  — >  wl2, 
wl6  -»  wl3,  wl7  ->  wl4,  wl2  ->  wl5,  wl3  ->  wl6,  wl4  ->  wl7 

opHx2314 

w3  ->  w2,  w4  -»  w3,  w2  w4,  w9  ->  w6,  wlO  ->  w7,  wll  w8, 
wl2  -»  w9,  wl3  wlO,  wl4  ->  wll,  w6  wl2,  w7  wl3,  w8  -»  wl4 

opHx2341 

w3  w2,  w4  — >  w3,  w5  -4  w4,  w2  -4  w5,  w9  -4  w6,  wlO  -4  w7, 
wll  _>  w8,  wl2  -»  w9,  wl3  -4  wlO,  wl4  —*  wll,  wl5  -4  wl2, 
wl6  -4  wl3,  wl7  -4  wl4,  w6  ->  wl5,  w7  -4  wl6,  w8  -4  wl7 

opHx2413 

w3  —4  w2,  w5  —4  w3,  w2  —4  w4,  w4  —4  w5,  w9  —4  w6,  wlO  —4  w7, 
wll  —4  w8,  wl5  —4  w9,  wl6  —4  wlO,  wl7  —4  wll,  w6  —4  wl2, 
w7  -4  wl3,  w8  -4  wl4,  wl2  ->  wl5,  wl3  -4  wl6,  wl4  -4  wl7 

opHx2431 

w3  —4  w2,  w5  —4  w3,  w2  —4  w5,  w9  —4  w6,  wlO  —>  w7,  wll  —4  w8, 
wl5  -4  w9,  wl6  -4  wlO,  wl7  -4  wll,  w6  wl5,  w7  -4  wl6,  w8  -4  wl7 

opHx3124 

w4  -4  w2,  w2  -4  w3,  w3  -4  w4,  wl2  -4  w6,  wl3  -4  w7,  wl4  -4  w8, 
w6  -4  w9,  w7  -4  wlO,  w8  -4  wll,  w9  -4  wl2,  wlO  -4  wl3,  wll  -4  wl4 

opHx3142 

w4  —4  w2,  w2  —4  w3,  w5  —>  w4,  w3  —4  w5,  wl2  — 4  w6,  wl3  — 4  w7, 
wl4  —4  w8,  w6  —4  w9,  w7  —4  wlO,  w8  -4  wll,  wl5  —4  wl2, 
wl6  —4  wl3,  wl7  —4  wl4,  w9  —4  wl5,  wlO  —4  wl6,  wll  -4  wl7 

opHx3214 

w2  i - 4  w4,  w6  < - 4  wl2,  w7  « - 4  wl3,  w8  < - 4  wl4 

opHx3241 

w4  -4  w2,  w5  -4  w4,  w2  -4  w5,  wl2  -4  w6,  wl3  -4  w7,  wl4  -4  w8, 
wl5  -4  wl2,  wl6  -4  wl3,  wl7  -4  wl4,  w6  -4  wl5,  w7  -4  wl6,  w8  -4  wl7 

opHx3412 

w4  -4  w2,  w5  -4  w3,  w2  -4  w4,  w3  -4  w5,  wl2  -4  w6,  wl3  -4  w7, 
wl4  -4  w8,  wl5  -4  w9,  wl6  -4  wlO,  wl7  -4  wll,  w6  -4  wl2, 
w7  -4  wl3,  w8  -4  wl4,  w9  -4  wl5,  wlO  -4  wl6,  wl  1  -4  wl7 

opHx3421 

w4  -4  w2,  w5  — 4  w3,  w3  -4  w4,  w2  -4  w5,  wl2  -4  w6,  wl3  -4  w7, 
wl4  -4  w8,  wl5  -4  w9,  wl6  -4  wlO,  w!7  -4  wll,  w9  -4  wl2, 
wlO  -4  wl3,  wll  —4  wl4,  w6  -4  wl5,  w7  -4  wl6,  w8  ->  wl7 
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Table  A-3  Continued 

Symmetry  Operations  Associated  with  the  Error  Surface 

of 

Network  8.3  Performing  the  XOR  mapping 


Notation 


opHx4123 


opHx4132 

opHx4213 

opHx4231 

opHx4312 


opHx4321 


>  w2,  w2  — > ' 
-4  w8,  w6  — 4 
-4  wl3,  wll 

►  w2,  w2  — 4  ' 
4  w9,  w7  — >  ■ 
4  w2,  w2  — > 

4  wl2,  w7  — i 
— >  w5,  w6  « 

♦  w2,  w4  — 4 
-4  w8,  wl2  - 

♦  wl3,  w8  — ■ 
■>  w2,  w4  -4 
— »  w8,  wl2  • 
-4  Wl3,  wll 


w3,  w3  -4  w4,  w4  -4  w5,  wl5  -4  w6,  wl6  -4  w7, 

4  w9,  w7  —4  wlO,  w8  —>  wll.  w9  —>  wl2, 

-4  wl4,  wl2  — 4  wl5,  wl3  —>  wl6,  wl4  —4  wl7 
w3,  w3  —4  w5,  wl5  —4  w6,  w!6  -4  w7,  wl7  —4  w8, 
wlO,  w8  -4  wll,  w9  -4  wl5,  wlO  -4  wl6,  wll  -4  wl7 
w4,  w4  —4  w5,  wl5  — 4  w6,  wl6  — 4  w7,  wl7  —4  w8, 

4  wl3,  w8  -4  wl4,  wl2  -4  wl5,  wl3  -4  wl6,  wl4  -4  wl7 

< — 4  wl5,  w7  < - 4  wl6,  w8  < - 4  wl7 

w3,  w2  -4  w4,  w3  -4  w5,  wl5  -4  w6,  wl6  -4  w7, 

-4  w9.  wl3  -4  wlO,  wl4  -4  wll.  w6  -4  wl2, 

4  wl4,  w9  —4  wl5,  wlO  —4  wl6,  wll  —4  wl7 
w3.  w3  —4  w4,  w2  —4  w5,  wl5  —4  w6,  wl6  —4  w7, 

-4  w9,  wl3  -4  wlO,  wl4  -4  wll.  w9  -4  wl2, 

1  _4  wl4,  w6  -4  wl5,  w7  -4  wl6,  w8  -4  wl7 


•  w8,  wlO  • 


■  wll,  wl3 


.  wl4,  wl6  < 


wl  —4  -wl,  w2  —4  -w2,  w3  —4  -w3,  w4  —4  -w4,  w5  —4  -w5, 
w7  -4  -w7,  w  10  -4  -wlO,  wl3  -4  -wl3,  wl6  -4  -wl6 
wl  — 4  -wl,  w2  — 4  -w2,  w3  — 4  -w3,  w4  — 4  -w4,  w5  — 4  -w5, 
w8  — 4  -w8,  wll  —4  -wll,  wl4  —4  -wl4,  wl7  —4  -wl7 


Number  of  Equivalent  Configurations  3072 


Representative  Sector 


w6  >  w9  >  wl2  >  wl5  >  0;  w7>w8>0 
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Table  A-4 

Symmetry  Operations  Associated  with  the  Error  Surface 

of 


Network  8.4  Performing  the  3-input  Parity  Mapping 

Notation  Description 


opHl 

opH2 

opH3 

opHxl32 

opHx213 

opHx231 


opHx312 


opHx321 

opIxl32 

opIx213 

op!x231 

op!x312 

opIx321 

opll 

opI2 

opI3 


w2  — > -w2,  w5  — » -wo,  w6  — 4 -w6,  w7  — »-w7,  w8  — *  -w8 
w3  — >  -w3,  w9  -4  -w9,  wlO  -4  -wlO,  wll  -»  -wll,  wl2  —4  -wl2 
w4  .w4,  wl3  -4  -wl3,  wl4  -4  -wl4,  w!5  -4  -wl5,  wl6  -4  -wl6 

w3  < _ >  w4,  w9  * — >  wl3,  wlO  < — 4  w!4,  wll  < - >  wl5,  wl2  « - 4  wl6 

w2  < _ 4  w3,  w5  < — 4  w9,  w6  < — 4  wlO,  w7  < - 4  wll,  w8  « - 4  wl2 

w3  _>  w2,  w4  — >  w3,  w2  -4  w4,  w9  ->  w5,  wlO  —4  w6, 
wll  _>  w7,  wl2  -4  w8,  wl3  -4  w9,  wl4  -4  wlO,  w!5  -4  wll, 
wl6  -4  wlZ  w5  — >  wl3,  w6  ->  wl4,  w7  -4  wl5,  w8  -4  wl6 
w4  w2,  w2  —4  w3,  w3  -4  w4,  wl3  —4  w5,  wl4  —4  w6, 

W15  _»  w7,  wl6  -4  w8,  w5  ->  w9,  w6  —>  wlO,  w7  -4  wll, 

w8  -4  wl2,  w9  —4  wl3,  wlO  — 4  wl4,  wll  -4  wl5,  wl2  -4  wl6 

W2  i — 4  w4,  w5  < — 4  wl3,  w6  < — 4  wl4,  w7  « — 4  wl5,  w8  < — 4  wl6 

w7  < - 4  w8,  wl  1  < — 4  wl2,  wl5  < - 4  wl6 

w6  < - 4  w7,  wlO  < - 4  wll,  wl4  < - 4  wl5 

w7  -4  w6,  w8  -4  w7,  w6  -4  w8,  wl  1  -4  wlO,  wl2  -4  wl  1, 

WlO  -4  W 1 2,  wl5  -4  wl4,  wl6  -4  wl5,  wl4  -4  wl6, 
w8  w6,  w6  -4  w7,  w7  w8,  wl2  —4  wlO,  wlO  —>  wll, 
wll  -4  W12,  wl6  -4  wl4,  wl4  -4  wl5,  wl5  -4  wl6, 
w6  « - 4  w8,  wlO  < - 4  wl2,  wl4  < - 4  wl6 

wl  —4  -wl,  w2  —4  -w2,  w3  —>  -w3,  w4  —4  -w4 
w6  —4  -w6,  wlO  -4  -W 10,  wl4  —4  -wl4 
wl  — 4  -wl,  w2  — 4  -w2,  w3  —4  -w3,  w4  —4  -w4 
w7  —4  -w7,  wll  —4  -wll,  wl5  —4  -wl5 
wl  —4  -wl,  w2  —4  -w2,  w3  —4  -w3,  w4  —4  -w4 
w8  —4  -w8,  wl2  —4  -wl2,  wl6  —4  -wl6 


Number  of  Equivalent  Configurations  2304 


Representative  Sector 


w5  >  w9  >  wl3  >  0;  w6  >  w7  >  w8  >  0 
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APPENDIX  B  Influence  of  Starting  Point  on  Gradient  Descent 


Table  B-l 

The  Comparison  of  Two  Classes  of  Steepest  Descents  in  Searching 
the  Minima  in  the  Error  Surface  of  Network  8.1  Performing  the  XOR  Mapping. 

A)  The  Starting  Points  of  the  Processes  are  randomly  distributed  in  [—1 .5 ,  1.5] 

B)  The  Starting  Points  of  the  Processes  are  randomly  distributed  in  [—0.0,  3.0] 

Identifier 

MA 

MB 

MC 

Error 

0.0803 

2.0969 

2.8835 

Percent  of  Trajectories 
in  Class  A 

Approaching  the  Point 

74.1 

25.6 

0.3 

Percent  of  Trajectories 
in  Class  B 

Approaching  the  Point 

60.3 

37.5 

2.2 

Table  B-2 

The  Comparison  of  Two  Classes  of  Steepest  Descents  in  Searching 
the  Minima  in  the  Error  Surface  of  Network  8.2  Performing  the  XOR  Mapping.^ 

A)  The  Starting  Points  of  the  Processes  are  randomly  distributed  in  [-1.5, 1.5] » 

B)  The  Starting  Points  of  the  Processes  are  randomly  distributed  in  [-3.0,  3.0] 1-3 


Identifier 

MD 

ME 

MF 

MG 

Error 

0.0605 

0.0665 

0.0803 

2.0965 

Percent  of  Trajectories 
in  Class  A 

Approaching  the  Point 

74.8 

20.7 

0.8 

Percent  of  Trajectories 
in  Class  B 

Approaching  the  Point 

67.1 

25.3 

1.5 

6.1 

